I have no idea what this thing does. Can someone explain in simple terms what it does?
My organisation is currently investigating on installing Spark on the theory that it connects to databases and we need analytics. As far as I can tell it breaks analytics work into parallel workloads.
Differential dataflow lets you write code such that the resulting programs are incremental e.g. if you were computing the most retweeted tweet in all of twitter or something like that and 5 minutes later 1000 new tweets showed up it would only take work proportional to the 1000 new tweets to update the results. It wouldn't need to redo the computation across all tweets.
Unlike every other similar framework I know of, Differential can also do this for programs with loops / recursion which makes it more possible to write algorithms.
Beyond that, as you've noted it parallelizes work nicely.
I wrote a blog post that was meant to explain "what does Differential do" and "when it is or isn't useful" and give some concrete examples that might be helpful. https://materialize.com/life-in-differential-dataflow/
are all the retweets counts of every tweet stored in the memory ?
Where are previous counts stored that are merged with new stream of tweets.
Would it be possible to improve the efficiency of the total calculation by re-using the pre-computed "a + b" if only "c" changes?
Differential dataflow is one way to do that, but only really applies if you have lots of data with complex calculations. For analytics, maybe the "a + b" calculation would cover your last 5 years of operations, and then when a new day's worth of data comes in, you just compute the changes to the totals, rather than re-computing the analytics for all those years, all without manually having to write distinct "total" and "update" code.
Edit: lots of downvotes, yet no replies? Can explain someone why my comment is apparently so terrible...?
But you're right, at it's core this kind of problem will use techniques like memoization, and most projects that need something like this will have adhoc approaches to solve this problem. The advantage of differential data flow is that it's a generalized approach to this problem. The business logic behind these workflows, between tracking dependencies and updates can get pretty damn complicated and difficult to maintain. Having a generalized approach would make building these dataflows much simpler.
The paper it's based on is pretty skimmable, so I recommend taking a look at it.
I think I maybe a bit environmentally damaged from mainly using Clojure. Algorithms similar to this are fairly common in the Clojure ecosystem. Memoization is part of the standard library too.
The claim (in the article) that no one cares about Differential Dataflow seems to be only true when talking about this specific library. The general concept surely translates to some combination of simple concepts like memoization, topological sorting, partial application, etc. so it's obvious to me that many adhoc implementations would exist tailored to more specific needs in a different programming languages with different feature sets. Sometimes buying into a framework is a lot more work than rolling your own, especially if it means having to switch to a different programming language.
I don't really want to pull rank here, but for the benefit of other readers: 100% nope.
I personally find the "make toxic comments to draw folks out" rhetorical style frustrating, so I'll just leave you with a video from Clojure/conj about how nice it would be to be able to use DD from Clojure, to get a proper reactive Datomic experience.
It's puzzling to me why you _wouldn't_ want to "pull rank", as you say. I did not pretend to be an expert in this domain. I'm really just exposing my knowledge and speculating about why people apparently aren't using this framework, which is what the damn submission is about. Did you even read it?
It seems like I managed to piss off a bunch users of the framework, who - rather than simply explain in clear terms why I'm supposedly wrong - instead just downvote away and make passive-aggressive comments that assume I'm some sort of troll.
Remind me to never engage with the Rust community again. Jfc.
Edit: Oh, so you're the creator of the framework? If you go straight to calling people toxic when they have questions about it, I think I understand why no one wants to use it.
That being said, rest assured that your experience says absolutely nothing about the wider Rust community. It's one of the most helpful ones I've engaged with.
So please don't judge it by one strangely toxic framework creator.
It's very common to see people with toxic attitudes in and around the Rust community, even in their internal communication about how to use Rust (`actix-web`, anyone?). I don't think it's helpful to lie to yourself about the Rust community like this.
The only thing Rust users who don't want to have these conversations can do is to openly recognize and talk about the extreme fanaticism Rust users commonly display and the toxic pattern of communication that sometimes is bundled or separate, when it comes to priorities in software dev.
Seems like many people here aren't actually willing to engage in a discussion. I guess this submission is basically just native advertisement for the framework in question.
It's much better to have the update logic handled in a thoroughly tested library, and build your application logic on top.
I'm not sure that differential dataflow is that simple as I haven't checked the paper nor the repository—according to other replies, it isn't—but if it were, that's all the more reason to use a library/tool instead of reimplementing something for the thousandth time, I think.
Can you give examples of real world programming tasks that don't fit this definition?
> basic application
> simple concepts
Topological sorting (along with basic graph theory) is something any beginners course on discrete mathematics should already have taught you. You would reach for that in most cases where you need to deal with a graph of dependencies.
The other part of the puzzle is about storing calculations, i.e. memoization. This is trivial in my language of choice, but really it's not a hard problem to solve in any language. You map function inputs to outputs somewhere in memory and retrieve the results when needed.
These techniques are broadly applicable in many domains. To many people, the intuition would be to reach for them directly when they have a task that looks like a graph search, rather than go look for a framework or library that they would then need to read the documentation for and spend time integrating with their code. Sometimes less is more.
My point is really just that a lot more people would think about this as a graph or memoization problem than would ever think to go look for a Rust framework. Maybe if their own solution at some point doesn't work out, they will start searching for frameworks or libraries.
If everything exists at 7am, and/or you don't need the freshest computed values, this is not the solution you need.
If data A is ready between 2-4am, data B at noon, and data C sometime between 8am-6pm, this allows you to abstract that uncertainty into code, then let the system solve it on a daily basis.
This is not a problem everyone has. But it is a problem most people working with inventory or events have! And it's usually a problem people feeding things to ML have.
and also this: http://muratbuffalo.blogspot.com/2017/11/on-dataflow-systems...
Timely dataflow was inspired by Naiad.
SELECT SUM(A) FROM MyTable
Something similar is implemented in these libraries which however rely on a different data processing conception (alternative to map-reduce):
https://github.com/asavinov/prosto - Functions matter! No join-groupby, No map-reduce.
https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last!
My personal favorite is "Functional hybrid modelling" - https://github.com/giorgidze/Hydra
I think the biggest issue is the documentation, not so much on writing code, but on building an actual production service using it. I think most of us can now grok that you have a Kafka Stream on one end and a datastore on the other, and the quintessential map/reduce hello world is WordCount.java. That doesn't isn't clear from the differential dataflow documentation - I remember thinking how are they getting data from the outside world into this thing, then thinking maybe I don't understand this project at all.
Consider the example in the ReadMe - the hello world is "counting degrees in a graph". While it gives you an idea of how simple it is to express that compuation, it isn't interactive - it's unclear how one might change the input parameters (or if that's even possible). The hardest part of most of these frameworks is glue - but once you have that running then exploring what's possible is much easier. Differential Dataflow doesn't provide that for me right off the bat.
That said - I'm not surprised, when I last checked it out Rust Kafka drivers weren't all there and it seemed to be evolving parallel to everything else. I think what would make it more popular is a mental translation of common Spark tasks (like WordCount) to differential dataflow.
Yeah, the readme is pretty dense on terminology that is unfamiliar (at least to me).
It answers that question like this:
> In the examples above, we can add to and remove from edges, dynamically altering the graph, and get immediate feedback on how the results change
but it would be great to show an example of that in code, as otherwise it is easy to assume that "reachable" is a fixed result set, when the whole point of the system is that presumably you can subscribe to changes in "reachable" as "roots" or "edges" change.
Spark is a "product" -- it has extensive documentation, supports multiple languages, and generally is production-grade. It is full of "nice-to-haves", like interactive status monitors, helper functions, serialization layers, and so on. It integrates with existing technologies (Hadoop, K8S, HDFS, etc..). It has this "finished product" feel.
"differential dataflow" seems to be a library. It only supports a single language. The documentation is very basic (it was not even clear if there is a way to run this on multiple machines or not). It is very bare-bones -- there are only a few dozens functions, and no resource monitors and interactive shells. It does not seem to integrate with anything. It has "research software" feel -- there are random directories in top-level repo, academic papers, and so on.
(it would probably be more fair to compare "materialize" to Spark...)
Having spent much of last year pointing kafka at differential dataflow and watching kafka fall over or fail to start, I definitely feel like I would trust differential dataflow more. But I agree that nothing about the presentation of the project gives that impression.
For example, Spark makes it clear that its performance comes from exclusively in-memory compute across a cluster.
Spark may also be "good enough" from a performance standpoint for many use cases. New tools can get adoption among small players only if they are radically easier to use and deploy.
I have once needed to set up webapp written in Python. I did this by running the code in WSGI instance, and it via nginx. Setting up all the activation files, locked-down permissions, secure sockets was pretty finicky to get right, and took non-trivial amount of time.
It would have been much easier to use Python's built-in web server and expose to internet directly. It has fewer moving parts and generally more predictable.
I still went with more complex solution -- because I needed logging, security and large file offload. And I used built-in web server for development.
There requirements of "production" system are pretty different than "development" one. Sometimes people are willing to install bigger (and therefore more fragile) system when they need more features.
I say this as a huge fan of Kafka, but things like MySQL have better defaults and are easier to get running out of the box, and there's no reason Kafka's starting experience couldn't be the same if someone cared enough to put the time and effort in. And ultimately it's a shame, because it leads people to ignore something that's a much better model and platform in the long term.
There's always a perverse incentive when the software provider's model is to monetize through support, consulting, and offering the software as a managed service. If it's too easy to run, then why would one pay for any of these services?
100% worth it IMO, but it's a lot of upfront cost and you only start to see the benefits when a given flow is Kafka end-to-end and you learn how to use it, so I absolutely get why people are skeptical.
Compute and storage capabilities keep growing rapidly, if one structures their data well, uses a reasonable query processor, and some form of underlying columnar storage then computing calculations on TBs of data can be accomplished in seconds for low costs.
Being able to recompute the world from scratch is a p0 requirement for most analytic workloads, as otherwise migrations and un-forseen computation changes due to new product requirements and other activities become painful.
This leaves differential techniques in an awkward spot where to be effective they need to
1) Operate on vast quantities of data or sufficiently complex calculations such that optimization of compute is a concern to the end-user.
2) Operate in a computational environment that is sufficiently constrained such that all present and future changes to the computation can be reasonably accounted for.
3) Be transparent enough that engineers don't feel that they are duplicating logic.
It's hard to think of applications that would meet this criteria outside of intelligent caching of computation within DB engines.
I agree that cloud has made computing power more accessible, but there are known limits regarding scalability. Moreover, using distributed computing (spark, etc) kills efficiency (see all those posts about a laptop beating a medium sized big data cluster).
In terms of efficiency, the big question is "does an incremental gain in efficiency matter for this application?" if we're talking about < 10x performance/cost change the answer will be no for most teams. Consider how many of these big data applications are implemented in python or other interpreted languages.
Hard to answer, considering....
>It's had very little advertising?
Personally, I've never heard of it and have no idea what differential dataflow is. Maybe I've done but I never gave or discovered a name for it. I don't know what spark or kafka streams are. Maybe because I've never had a use case for those that wasn't satisfied by a tool that was "good enough", or, more likely, I haven't come across anyone recommending those on projects, because they also don't know what those tools are. I would have never known what RabbitMQ was if a coworker never suggested we use it to build queues, and it turned out to be cumbersome to use and 100x more complicated than writing a stored procedure that turned out to be "good enough". Most tools fall into that space where they are marginally better in some regards over "good enough", but not better enough to accomadate for the learning curve for other developers, changes in maintenance or design, cost, etc. Advertising is pretty general and it's hard to say if which of these it's doing wrong, and depending on their market none of these might be wrong for them, just the potential market is content with "good enough" and have no need to search for tools like this.
>Rust is intimidating?
I'm not sure what the stats on Rust are but I don't think its that popular for business developers to where you could point to it for the reason a tool has failed the adoption phase
In my teens and twenties I collected ideas the way some folk collect stamps. The simple fact of the matter is that there are amazing things out there that you've never heard of, and no one really seems to care.
(As an aside, I hate the question, "If FOO is so great why doesn't everyone use it?" I do not know. That's not my department.)
These days some of these things are better known and some even have Wikipedia pages and stuff ( E.g. https://en.wikipedia.org/wiki/Vaneless_ion_wind_generator ) but a lot of others are still obscure (trawl through Rex Research if you want to look for weird tech.)
Like, there's a mechanism that can absorb kinetic energy. The demo has a little car on rails with a ramp at one end and a wall at the other. They put a wineglass at the wall and they put the car on the ramp and let it go: the glass shatters. They activate the device and repeat: car hits glass and halts, glass does not shatter. Messed up, right? They're from Poland IIRC, they've been doing demos at trade shows. I bet you've never heard of them. (Bug me and I'll try to dig up a link; They're in Rex Research.)
I already mentioned the "Vaneless" ion wind generator, an efficient solid-state device for converting wind into electric power without e.g. killing birds with spinning vanes. Cheap, simple, easy, durable, been around for decades, and you just now heard about it, eh? :)
There's a battery that desalinizes salt water. A nuclear reactor made of molten salt. Balloons stronger than steel. There's a guy in Michigan, Wally Wallington, who figured out how to move monoliths single-handedly, they walk just like the old stories say!
Anyway, I'm getting ranty here. To veer back on topic: yeah, it's a bummer, you build an awesome mousetrap and even the people with lots of mice ignore it. I wish I knew what to tell you. Maybe paint it mauve?
The vaneless ion generator is >5x less efficient.
Molten Salt Reactors, on the surface, sound quite safe, but nuclear power's primary problem is economics, and this is an unproven technology operating over the long time scales needed to amortize reactor construction costs. Nuclear power systems present novel failure mechanisms that don't exist in everyday technology, such as the corrosion mechanisms caused by coupling of mechanical, thermal, chemical, and radiological stresses, and MSR's present a new set of these that are poorly understood. Additionally, MSR's exhibit thermal shocks in normal operations orders of magnitude above what may be produced in a water-cooled reactor.
It's just not that simple - it looks simple to you.
Going back to the original article "Why isn't differential dataflow more popular?" we could ask "Why aren't Molten Salt reactors more popular?"
Biodegradable plastic has been a thing for over half a century yet the
whole planet is choking on discarded plastic. Where's the sane reason for that?
Cars aren't sane, we had to convince ourselves that they were though a
deliberate campaign of domestic propaganda! Cars kill hella people
(I don't have stats for the rest of the world but more Americans have
been killed by cars than have died in all the wars we've fought!) And
that's just the people that get hit. There's air pollution (car exhaust is
a deadly poison) and tires are constantly wearing down and giving off
vulcanized rubber particles. I won't deny their convenience, but you'll
never convince we they're sane.
Or my old pet peeve: refrigerators. They open like cabinets instead of
like drawers. The cold air goes out, the warm air goes in, and you get
to pay for the energy to cool it off. Why not make them like a chest of
Sanity? I think not.
 "The Real Reason Jaywalking Is A Crime" (Adam Ruins Everything)
Replying here to the other fella:
> The vaneless ion generator is >5x less efficient.
C'mon, less efficient by what metrics? Total cost over the lifetime of the structure, including maintenance? Killing eagles?
In re: Molten Salt reactors there's a old page about Molten Salt tech in general that talks about reactors towards the end:
> The second MSR was a civilian power plant prototype, the Molten Salt Reactor Experiment (MSRE)7. Hugely successful, it was ignored by the US Atomic Energy Commission (US AEC), which had decided to favor the Liquid Metal Fast Breeder Reactor (LMFBR). The Director of ORNL, Dr. Alvin Weinberg, pushed for the MSR, but was fired for his efforts 8.
> 8. Pages 198 - 200, "The First Nuclear Era : The Life and Times of a Technological Fixer", by Alvin Martin Weinberg (1994).
I've read a longer history of Molten Salt reactors that described the events referenced in the above quote. Evidently both kinds of reactor showed promise early on (MS and LMFB) but for bureaucratic reasons the one was passed over for the other.
This one has an easy answer. If you want to isolate each of the drawers so that opening one does not result in the warming of the others, then you'll need a separate mechanism for blowing cold air into each drawer. (Or worse, a separate cooling coil for each drawer.) These are expensive mechanical components prone to failure, and each of these mechanisms takes up valuable space in the fridge. They may provide some long term amortized savings over the energy cost if done correctly, but it's not at all obvious to me that would be the case without seeing the math worked out.
Sanity? I think not.
Unless you live in the north pole and get outside temps in the 3°C range, this idea is already a no-go. Refrigerators are pretty efficient, using the same energy as a single incandescent lamp. They are not fighting the internal house temperature at all - compressors/heat pumps actually get less efficient at colder temperatures, and the heat it sheds off is offsetting your heating costs.
The average upfridge at homedepot (USA) consumes 500kWh ish a year. This is to cool down from an internal ambient of 20C. It absolutely takes more work to cool from 20C than it does 10,5 or 0C. This is the conservation of energy. Cooler _efficiency_ changes as it approaches the setpoint do not mean that I can have free energy by having a larger delta.
Modern refrigerators have been designed specifically (forced by the government) because they are always-on devices that in aggregate consume large amounts of energy. You are conflating low power and always-on consumption of energy.
I'll absolutely engage with you, but it needs to be over facts and mathematics and not hand wavy arguments about "pretty efficient" vs a incandescent lamp.
Part of it is habits, which are not easy to change. And the fact that when they are changed, it’s often incrementally, by substituting them with something similar enough.
Thanks for the reminder of the amazing things that are out there, but just haven’t cracked the mainstream. I think about it a lot with software from the 90s. Things like HyperCard. Stuff that, despite lower processor rates and memory, seemed MORE sophisticated than a lot of what’s standard today.
And yes! HyperCard is a great example. Really picking on the IT industry is like taking candy from a baby, we're so woefully ignorant of our own prior art.
Yes, it's called braking.
News articles, links to more videos, patents, etc.: http://www.rexresearch.com/lagiewka/lagiewka.htm
"A small Fiat 126p, going 45 km per hour, was driven into a concrete wall. The bumper was not damaged. The driver wore no seatbelts. The inertial reaction, which should have thrown him onto the hood, did not ocur. The stopping distance was only 16 centimetres. Impossible? Yet hundreds of people in, and the stadium, and millions more on television. The use in all vehicles of the absorber of the energy, "Ecollision", can radically improve automobile safety."
Regardless of the mechanism of energy absorption, if a vehicle is decelerated from 45 kph to 0 within 16 centimeters, there needs to be a way to remove the kinetic energy from the driver inside as well (seatbelts). If not, they will simply continue flying along at 45 kph, through the windshield. If the driver in this case was not wearing seatbelts, it's quite likely that the vehicle was not going that quickly.
``Lagiewka says, "The technical idea behind my buffer can be used in very many practical solutions. Another invention which I showed experimentally, is the brake. Connected to the axis on a Mercedes, the car stopped in one-quarter of the distance usually required".''
When braking, the limiting factor is usually not the brake mechanism itself (e.g. the disc brakes), but the interface between tires and the road. That's why cars have anti-lock systems.
It's possible. I certainly haven't tried to replicate the mechanism.
In the video you can see the driver does experience "the inertial reaction" when the car halts, at least it seems to me that he leans forward due to inertia at that moment. I think that description is just bad. (Or, of course, it was just a weird hoax.)
I just watched the video again with the playback speed at 1/4 (thanks Youtube!) and at 0:46 ( https://youtu.be/z-h56N_A3rY?t=45 ) when the car hits the bumper I can clearly see the driver lean forward due to inertia. Of course you still need to wear a seatbelt, this isn't magic. (You know on Star Trek how the Klingon disruptors make a person disintegrate, clothes and all, yeah? How does the disintegration process know to stop at the floor? What is it about the interface between boots and floor which stops the disintegration? Could you use that to make disintegration-proof armour?) I get what you're saying. I do.
The point is not that the driver magically didn't feel inertia (he did, and you can clearly see it in the video) it's that the car didn't crumple. The kinetic energy given up by the stopping car went into the flywheel rather than into violent deformation of the physical structure of the car. I.e. it works (if it works at all, I don't deny that it might be a hoax) like a "crumple zone" without the crumple:
> Crumple zones are designed to increase the time over which the total force from the change in momentum is applied to an occupant ...
He clearly didn't experience 50g deceleration, he would have been sent flying through the windshield.
What's more likely is that the car actually did take a lot more than 16cm to slow down to 0 m/s or that it wasn't going at 45km/h. I would bet on the second, the car looks like it's going at maybe 20km/h in the video.
> The kinetic energy given up by the stopping car went into the flywheel rather than into violent deformation of the physical structure of the car.
That's an interesting idea but if it ends up being less effective at protecting the humans inside I don't think most people will choose it over normal crumple zones.
Of course. Who cares if you can just reset the flywheel (as opposed to scrapping the crumpled car) if you still have to scrap the people off of the dashboard?
My whole point is that there are devices like this one that may be more effective, given some R&D, but that get neglected.
Let me put forward another, perhaps less physically controversial, example: the "Rolomite".
> Rolamite is a technology for very low friction bearings developed by Sandia National Laboratories in the 1960s. It is the only elementary machine discovered in the twentieth century and can be used in various ways such as a component in switches, thermostats, valves, pumps, and clutches, among others.
> The Rolamite was invented by Sandia engineer Donald F. Wilkes and was patented on June 24, 1969. It was discovered while Wilkes was working on a miniature device to detect small changes in the inertia of a small mass. After testing an S-shaped metal foil, which he found to be unstable to support surfaces, the engineer inserted rollers into the S-shaped bends of the band, producing a mechanical assembly that has very low friction in one direction and high stiffness transversely. It became known as Rolamite.
Or the Hilsch-Ranque vortex tube:
> The vortex tube, also known as the Ranque-Hilsch vortex tube, is a mechanical device that separates a compressed gas into hot and cold streams. The gas emerging from the "hot" end can reach temperatures of 200 °C (392 °F), and the gas emerging from the "cold end" can reach −50 °C (−58 °F). It has no moving parts.
Now these you can actually buy. They sell little ones that go on the end of an air compressor hose to deliver "spot cold" as it's called. I once emailed a company who makes them to ask what would happen if you set it up so that the cold (or hot) output chilled (or heated) the incoming air, would you get a feedback loop? But they weren't interested.
The vortex tube is less efficient than a heat pump, so there are good reasons not to use it in every potential application, but I feel that there are good applications that go completely unrealized. That's my original point. Just because some cool technology exists doesn't guarantee that it will be used well, or at all.
If you are interested: going from v to 0 or going from 0 to v uniformly with acceleration in time t is related by v=at. The distance traveled is s=1/2a t^2. Plugging the first into the second gives s=1/2 v^2/a. So a=1/2 v^2/s= 1/2 (10m/s)^2/0.1m=500m/s^2~50g
Interestingly this directly follows from the definition of acceleration and doesnt use anything like Newtons laws.
Looking at the Video, why didn't they just do it in a controlled environment? Some gauges/meter marking and high speed cameras. The time it takes for the stop is in the order of t=v/a=10/500s=1/50s so only one frame in normal video rate.
> why didn't they just do it in a controlled environment?
Well, there is more than one video. The one we're talking about is obviously a public demonstration and not a scientific test. (There was one video that seems to have been removed now that showed a very good and clear demo of the ramp/glass being done at some trade show or convention. that video or others might still be on YT somewhere.)
What about that ramp/glass demo? A glass shatters when the little car thingy hits it, and then another glass doesn't shatter when the flywheel device is active.
And really, I haven't dug into this particular tech too deeply, it could well be a hoax.
But my point still stands, there are lots of interesting and useful ideas that work and get ignored or neglected. Magnus effect rotors, the Tesla turbine, desalinizing batteries, "Aircrete", etc...
I could literally go on all day, just listing the less "woo-woo" stuff off of Rex Research.
Since the Rust community is relatively small, language bindings would be very helpful. Being able to configure pipelines from Java or Typescript(!) would be great.
Or maybe it's just that this form of computation is too foreign. By the time you need it, the project is so large that it's too late to redesign it to use it. I'm also unclear on how it would handle changing requirements and recomputing new aggregations over old data. Better docs with more convincing examples would be helpful here. The GitHub page showing counting isn't very compelling.
Personally, I keep an eye out on new technologies, but I'm not likely to embrace them without good reason. A fragmented tech stack is annoying.
This looks an awful lot like Spark to me. And doesn't seem to really solve the problems I typically experience with data engineering. For me, the biggest issue is orchestration. I don't see any facilities here for managing and executing data pipelines.
So, it seems to me that people aren't using dataflow more because it looks a lot like legacy products on the market. And it doesn't solve the massive problem of job orchestration and management. Apache Airflow + python + BigQuery is immensely powerful and dead simple to use. It's going to be hard to compete with.
The api is too hard to use?
The docs / tutorials are not good enough?
DD falls into an uncanny valley where the API surface is simple enough to grasp quickly yet foreign enough that actually grokking it is pretty hard, let alone applying it in an organization where maintenance is a top concern. To do anything nontrivial, you need knowledge of timely-dataflow too and the DD documentation doesn't do a good job of integrating knowledge from TD docs - they're written by someone who has already internalized that knowledge so it's an afterthought. Getting data in and out of the dataflow and orchestrating workers is pretty much undocumented outside of Github issue discussions. Trying to abstract away dataflows behind types and functions turns into a big ol' generic mess. There are a lot of rough edges like that (and the abomination crate is... well... an abomination).
McSherry's blog posts, while tantalizing, are often focused too much on static examples (entire dataset is available upfront) and are too academic-focused to make up for holes in the book. As far as I can tell, the library hasn't seen enough use for best practices to emerge and there's almost no guidance on how to build a real world system with DD.
By far the biggest problem I've had: I can avoid a DD project for a week or two at most before enough knowledge leaves my memory that I have to spend days rereading my own code to get reoriented and productive again. You either use unlabeled tuples which turns the dataflow into an unholy mess or you spend half your time writing and deleting boilerplate when doing R&D. DD is just too weird and the API too awkward - I haven't figured out a method for writing straightforward DD code.
That said, when I have gotten it to work on nontrivial problems, the performance and capabilities have been really impressive. I've just never been able to get the stars to align to use it in a professional context with future maintainers.
I think what DD needs is a LYNQ-like composable query language that abstracts away the tuple datatypes and provides an ORM/query builder layer on top of dataflow statements. Most developers are familiar with SQL statements which would make DD a lot easier to adopt.
I think the main issue is that in most shops is that the scale of their data isn't so large that a re-computation of a query with new data takes long enough that they would want to put it engineering effort to switch off more common tools like spark, airflow and columnar storage dbs. They're also likely, with decent engineering, not yet at a point where they run into tuning issues on their ingest side. An ETL taking an hour every night and then taking a couple seconds to run that query or even have that query set up on a job that just sends out a report isn't really an issue for most small - medium sized companies, and even at larger ones if your data throughput isn't particularly high I don't see people needing to reach for this for the same reasons.
You obviously can do those less intensive tasks in DDF but it doesn't really strongly make a case for itself in those regards, largely because DDF doesn't seem to offer anymore benefit on those smaller tasks, 15s to 230ms is a really tremendous leap in performance but for many companies I doubt the 15s is a bottleneck in the first place so it's not actually solving a problem there, it would be a nice to have.
I'm trying to help a couple (friends?) with getting the analysis of rslint  running well on DDlog or at least DDflow, with the end-goal being a perceptually zero-latency linter that typically responds faster than a human types.
We're currently seeing initial delays in the single-digit second range, and that's not even on large projects (the incremental performance is far better, but we would like to out-compete the official TS typechecker even in CI settings that don't keep the linter's state across runs).
The good news: we're making nice progress on profiling tools and I might get to trying some WCOJ code later today.
: https://mtrlz.dev/api/rust/dogsdogsdogs/calculus/index.html (3rd-party hosting of docs for the calculus/dogs^3 crate)
For the use cases I'm envisioning, this strikes me as a nice-to-have, and even then only if the persistence API were sufficiently easy to use (or at least to avoid).
> It's had very little advertising?
I hadn't heard much about it till now.
> Rust is intimidating?
At work I need a killer reason to inflict _any_ language on everyone else. We have a lot of shallow computation graphs (really the same few graphs on different datasets) and a few deep graphs which need incremental updates. The cost of an ad-hoc solution is less than the perceived cost (maybe the real cost) of adopting an additional language.
Broad classes of algorithms will basically expand to being full re-computations with this framework (based on a quick read of the whitepaper), and adopting a tool for efficient incremental updates is less enticing if I'm going to have to manually fiddle with a bunch of update algorithms anyway. E.g., kernel density estimation needs to be designed from the ground up to support incremental updates; a naive translation of those algorithms to dataflow would efficiently update some sub-components like covariance matrices, but you'd still wind up doing O(full_rebuild) work.
Like map reduce, most people do not understand how it works and why it is a useful paradigm.
Unlike map reduce, there is not an entire sub-industry of companies offering it as a service, and engineers who have used it for years without contemplating its alternatives. In absence of this background noise, people assume DD is niche, and even "wrong" or harmful.
We have new people who come in from time to time, who have experience working at a giant MR shop, who spend the first few months wondering aloud why we don't "just use <MR Framework>". They usually come around (if they care to understand how this new system works) or give up (usually because they never understood the MR trade-offs in the first place, but were unwilling to part with its style of thinking/working).
One thing I'll note is our jargon around it is extremely minimal and literal. The diction employed by DD (and TimeScaleDB!) feels very formal in comparison, which can be off-putting to prospective users.
I'm not one to advocate for dumbing down your tone (quite the opposite). However, it's interesting to note that the successful-yet-complicated projects (like MR, kafka) have an accretion disk around them of dumbed-down explanations on Medium, youtube and the like, that can lure in people who are curious but less-academic.
I don't think you can manufacture these. It's just a matter of time, until things like this appear for DD.
Rust took me around three attempts to get into, and it took a motivated project to really seal the deal, but at some point I understood enough that it just became programming again. Haven’t reached that with differential dataflow yet, but I’ll keep trying.
in case you're interested, we're a loose group working on DDlog  and DDflow to be able to use them for a JS/TS linter .
There are a couple fairly concrete and varyingly-isolated tasks on Timely , DDlog, as well as DDflow's dogs^3 (for the latter, boilerplate-encapsulating tooling with documentation for WCOJ ).
Let me know if you'd like to talk.
- It's somewhat hard to sell to management. There (was) no company behind it to provide support; and it's not a "successful Apache project"/ with large-ish community, either. And generally for a long while it was a passion project more than something Frank McSherry would actively encourage you to use in production.
- As other have said, the "hello world" is somewhat tricky. Not a lot of people know Rust. If you say "let's do this project in Rust", this will likely not go well; if I were able to use it from .NET and JVM, as a library, it might be an easier sell (I'm personally more invested in .NET now but earlier in my carer it would've been JVM)
- last but not least: the "productization" story is a bit tricky; comparing it to Spark does it no service. For Spark, not only do I have managed clusters like EMR, but I have a decent amount of tooling to see how the cluster executes (the spark web UI). Also I can deploy it in mesos, yarn not just standalone (and mesos/yarn have their own tooling). For differential dataflow, one had none of that (at least last time I checked). Maybe it'd be more fair to compare it to Kafka Streams?
* Might I add: spark-ec2 was a huge help for me picking up spark, since before the 1.0 version. You can do tons of work on a single machine, yes... but, for this kind of systems, the very first question is "how do you distribute that?". And you have the story that "it's possible", but you don't have easy examples of "word count, done on 3 machines, not because it's necessary but because we demonstrate how easy it is to distribute the computing across machines".
* Compared to Kafka Streams: the thing about Kafka Streams is that you know what to use it for (Kafka!) and one immediately groks how one uses this in production (all state management is delegated to Kafka, this is truly just a library that helps you work better with Kafka). With differential dataflow, it's much less clear. You could use it with Kafka, but also with Twitter directly, or with something else. And what happens if it crashes? How do you recover from that? What are the data loss risks? Does it give you any guarantees or do you have to manage that?
Batch data processing is very well understood, cheap and getting cheaper every year. So, if you can afford to boil the ocean every night, DD is a tough sell.
The addressable market, customers with problems which can only be solved with DD (instantaneous exactly correct answers) is probably small right now.
Materialize (built on differential dataflow) is cool but doesn't have the complete package of a persisted database.
Re: the second point — you’re right, Materialize has historically leveraged existing upstream systems (like Kafka) for things like persistence. But we also hear you loud and clear that not everyone wants to stand up Kafka :)
However, I also think differential dataflow solves a big problem for smaller companies building out their MVP or in early-stages. Firebase is popular because it's easy to set up, and it's realtime functionality on the client side mean you don't need to write a client-side data management layer, you can just use firebase's realtime functionality.
The issue is that firebase is completely untyped, isn't relational, and has limited queries. So you end up writing gnarly non-transactional code that makes many round-trip requests to query basic stuff.
I think there may be an opportunity product that combines the performance of and client-side tools of firestore, the ease of use of airtable and the real-time query and materialized view functionality of materialize into a database platform for businesses that want to scale their product.
Big ask obviously, but I know that a product like that would help me launch products much faster, I'd pay a lot for it.
There are the following difficulties in implementing such systems:
o (Small) changes in input have to be incrementally propagated to the output as updates rather than new results. This changes the paradigm of data processing because now any new operator has to be "update-aware"
o Only simple operators can be easily implemented as "update-aware". For more complex operators like aggregation or rolling aggregations, it is frequently not clear how it can be done conceptually (efficiently)
o Differential updates have to be propagated through a graph of operations (topology) which makes the task more difficult.
o Currently popular data processing approaches (SQL or map-reduce) were not designed for such a scenario so some adaptation might be needed
Another system where such an approach was implemented, called incremental evaluation, is Lambdo:
Yet, this Python library relies on a different novel data processing paradigm where operations are applied to columns. Mathematically, it uses two types of operations: set operations and functions operations, as opposed to traditional approaches based on only set operations.
A new implementation is here:
Yet, currently incremental evaluation is implemented only for simple operations (calculated columns).
o Generally, we do not want to re-compute aggregates - aggregates should be also updated, particularly, if n is very large
o In real applications, rolling aggregation is performed using partitioning on some objects. For example, we append new events from many different devices to one table and want to compute rolling aggregates for each individual device. Hence, this (i-n, i+n) will not work anymore.
o Rolling aggregation using absolute time windows will also work differently. Although, if records are ordered (like in stream processing) and there are no partitions, then it is easy.
"Scalability .. but at what COST?"
Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: https://arxiv.org/abs/1905.12133
Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB.
Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks:
Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data.
As for the data persistence, that's something the underlying approach for the aggregations could handle relatively well with LSM trees  (back then, `Aggregation` was called `ValueHistory`).
Along with syncing that state to replicated storage, it should not be a big problem to make it recover quickly from a dead node.
I know how long I want to wait, 30 minutes in one of my cases as I know that I've seen 95% of the important data by then. In the streaming world there is _always_ late data so being able to tell what should happen when the rest (5%) arrives is crucial for me.
This differs from use-case to use-case for me and being able to configure this and handling out-of-order data at scale is key for me when selecting a framework for stream processing.
Apache Beam and Apache Flink do this very well.
Taken from : Apache Beam has some other approach where you use both and there is some magical timeout and it only works for windows or something and blah blah blah... If any of you all know the details, drop me a note.
It obviously only works when you window your data as it needs to fit in memory. The event-time and system-time concept from Beam and Flink are very similar, also the watermark approach.
Thank you for sharing the links, For me it is now clearer where the difference lies between differential-dataflow and stream-processing frameworks (which also offer SQL and even ACID conformity!). I'm using Beam/Flink in production and missing out on one of these mentioned points is a deal-breaker for me.
What I would like to have is a choice - and Apache Beam for example lets you choose this.
If you think people in some communities would benefit then you should be proactive in advertising there and in particular providing bindings for their favourite languages. This would enhance discoverability. In my corner of the tech and science world, people mostly use python and/or R but few know about Rust and fewer have knowingly used it.
In general I wonder how many people sit in the intersection of those who are free and willing to base their system on rust, have dataflow problems to solve, and understand the advantages differential dataflow brings to the table?
Figuring out how to manage memory without constant de/serialization would be tricky, and it's unfortunate that it's so hard to do fusion across FFI, but I'd still expect it to happily outperform eg spark sql.
> Reflow thus allows scientists and engineers to write straightforward programs and then have them transparently executed in a cloud environment. Programs are automatically parallelized and distributed across multiple machines, and redundant computations (even across runs and users) are eliminated by its memoization cache. Reflow evaluates its programs incrementally: whenever the input data or program changes, only those outputs that depend on the changed data or code are recomputed.
I consulted for Motorola many years ago. I remember one of the Sr guys explaining to me their product view. New things needed to be 10x better than existing in order for MOT to be excited or want to invest in a new product, otherwise the switching cost / effort made it too risky that people wouldn’t bother to adopt a new thing.
Jane Street uses Incremental quite heavily in their trading platform.
My guess is that not a lot of people are using Rust to build the kinds of platform where this kind of library would see adoption, yet.
 Search for "Incremental Computability of a Dataset Transformation" https://patents.justia.com/patent/20180196862
I was surprised by how little attention online algorithms received when I first had to implement one.
My conclusion is that processing power currently overcomes the lack of definition or understanding people have about what they're building.
Anyway, now that I know something with that name exists, maybe someday I can learn how it works. Or will have a project where it is important.
DD for me was one of the final attempts to find something, anything, that could handle the requirements I was working with, because Spark, Flink, and others just couldn't reasonably get close to what I was looking for. The closest 2nd place was Apache Flink.
Over the last year I've read through the DD and TD codebases about 5-7 times fully. Even with that, I'm often in a position where I go back to my own applications to see how I had already solved a type of problem. I liken the project to taking someone use to NASCAR and dropping them into a Formula One vehicle. You've seen it work so much faster, and the tech and capabilities are clearly designed for so much more than you can make it do right now.
A few learning examples that I consider funny:
1. I had a graph that was on the order of about 1.2 trillion edges with about 90 million nodes. I was using serde derived structs for the edge and node structs(not simplified numerical types), which means I have to implement(or derive) a bunch of traits myself. I spent way more time than I'd like to admit trying to get .reduce() to work to remove 'surplus' edges that have already been processed from the graph to shrink the working dataset. Finally in frustration and reading through the DD codebase again, I 'rediscovered' .consolidate() which 'just worked' taking the 1.2 trillion edges down into the 300 million edges. For instance, some of the edge values I need to work with have histograms for the distributions, and some of the scoring of those histograms is custom. Not usually an issue, except having to figure out how to implement a bunch of the traits has been a significant hurdle.
2. I get to constantly dance between DD's runtime and trying to ergonomically connect the application into the tonic gRPC and tokio interfaces. Luckily I've found a nice pattern where I create my inter-thread communication constructs, then start up 2 rust threads, and start tokio based interfaces in one, and DD runtime and workers in the other. On bigger servers(packet.net has some great gen3 instances) I usually pin tokio to 2-8 cores, and leave the rest of the cores to DD.
3. Almost every new app I start, I run into the gotcha where I want to have a worker that runs only once 'globally' and it's usually the thread that I'd want to use to coordinate data ingestion. Super simple to just have a guard for if worker.index() == 0, but when deep in thought about an upcoming pipeline, it's often forgotten.
4. For diagnostics, there is: https://github.com/TimelyDataflow/diagnostics which has provided much needed insights when things have gotten complex. Usually it's been 'just enough' to point into the right direction, but only once was the output able to point exactly to the issue I was running into.
5. I have really high hopes for materialize.io That's really the type of system I'd want to use in 80% of the cases I'm using DD right now. I've been following them for about a year now, and the progress is incredible, but my use cases seem more likely to be supported in the 0.8->1.3 roadmap range.
6. I've wanted to have a way to express 'use no more than 250GB of ram' and have some way to get a compile time feedback that a fixed dataset won't be able to process the pipeline with that much resources. It'd be far better if the system could adjust its internal runtime approach in order to stay within the limits.
If you do, you might be interested in the LSM tree ideas  from when arrangements where called `ValueHistory` to offload part of the memory usage to SSDs.
: https://mtrlz.dev/api/rust/dogsdogsdogs/index.html (3rd party rustdoc for convenience)
I have no idea what this thing does. Can someone explain in simple terms what it does?
My organisation is currently investigating on installing Spark on the theory that it connects to databases and we need analytics. As far as I can tell it breaks analytics work into parallel workloads.