Great article, one quibble: there isn’t really a clear dividing line between batch and streaming. If you process data one row at a time, that is clearly a streaming pipeline, but most systems that call themselves streaming actually process data in small batches. From a user perspective, it’s an implementational detail, the only thing you care about is the latency target.
Nearly all data sources, including the changelogs of databases, are polling-based APIs, so you’re getting data from the source in (small) batches. If your goal is to put this data into a data warehouse like Snowflake, or a system like Materialize, the lowest latency thing you can do is just immediately put that data into the destination. I sometimes see people put a message broker like Kafka in the middle of this process, thinking it’s going to imbue the system with some quality of streamyness, but this can only add latency. People are often surprised that we don’t use a message broker at Fivetran, but when you stop and think about it there’s just no benefit in this context.
> If you process data one row at a time, that is clearly a streaming pipeline, but most systems that call themselves streaming actually process data in small batches. From a user perspective, it’s an implementational detail, the only thing you care about is the latency target.
Author here. 100% agreed.
As an aside, I just came across your post about how Databricks is an RDBMS [0]. I recently wrote a similar article from a slightly more abstract perspective [1].
Having worked heavily with RDBMSs in the first part of my career, I feel like so many of the concepts and patterns I learned about there are being re-expressed today with modern, distributed data tooling. And that was part of my inspiration for this post about data pipelines.
https://www.confluent.io/blog/turning-the-database-inside-ou... might be interesting if you haven't seen it already. The way I see it RDBMSes already had most of the tech, but they wrap it up in an opaque black box that can only be used one way, like those automagical frameworks that generate a full webapp from a couple of class definitions but then fall apart as soon as you want to do something slightly specialised. So the data revolution is really about unbundling all the cool RDBMS tech and moving from a shrinkwrapped database product to something more like a library that gives you full control over what happens when, letting you integrate your business logic wherever it makes sense.
> I feel like so many of the concepts and patterns I learned about there are being re-expressed today with modern, distributed data tooling
So much this.
I work on Flow [0] at Estuary [1]; you might be interested in what we're doing. It offers millisecond latency continuous datasets that are _also_ plain JSON files in cloud storage -- a real-time data lake -- which can be declaratively transformed and materialized into other systems (RDBMS, key/value, even pub/sub) using precise, continuous updates. It's the materialized view model you articulate, and Flow is in essence composing data lake-ing with continuous map/combine/reduce to make it happen.
I was asked the other day if Flow "is a database" by someone who only wanted a 2-3 sentence answer, and I waffled badly. The very nature of "databases" are so fuzzy today. They're increasingly unboxed across the Cambrian explosion of storage, query, and compute options now available to us. S3 and friends for primary storage; on-demand MPP compute for query and transformation; wide varieties of RDBMSs, key/value stores, OLAP systems, even pub/sub as specialized indexes for materializations. Flow's goal, in this worldview, is to be a hypervisor and orchestrator for this evolving cloud database. Not a punchy elevator pitch, but there it is.
Strong concur, the idea of "database" just doesn't make sense anymore, I've (d?)evolved to "storage, streaming, and compute" (as it seems you do as well) in all my discussions.
Have you read ThoughtWorks material on the "data mesh"? Sounds like your product is looking to be a part of that kind of new data ecology.
It is in definitely anticlimactic after data warehousing, data lakes, data lakehouses, etc. to just throw up your hands and say "data whenever you want it, wherever you want it, in whatever form you want it" (at whatever price you're willing to pay!) So I feel your pain on marketing your product,but I think the next 5 years or so will be heavily focused on automating data quality and standardized pipelines, computational governance and optimizing workloads, and intelligent "just in time" materializaton, caching, HTAP, etc.
Your big play in my mind is helping customers optimize the (literal financial) cost/benefit tradeoff of all those compute and query engines.
Can somebody explain Data Mesh in simple terms? I've watched numerous videos and read the article, but still can't understand what it's supposed to say.
Is it basically to set up one data lake per analytics team, instead of a big centralized data lake?
It seems to me this is already being done from my experience. Each team has their own s3 bucket in which they store their data products and other teams can consume them using their favorite data engines as long as they can read parquet files from s3.
It's basically a set of 4 principals or best practices:
1. decentralized data ownership.
2. Data as a product.
3. Self-serve data infrastructure.
4. Federated governance.
So I don't think of it as some new product thing but rather principals that work well for people with a certain type of use case. It's a bit like Microservices in that regard.
Interesting project! (The interactive slides are cool btw.)
Could you share a bit about how engineers express data transformations in Flow?
From a quick look at the docs, it doesn't look like you are using SQL to do this, which is interesting since it bucks the general trend in data tooling.
You write pure-function mappers using any imperative language. We're targeting TypeScript initially for $reasons, followed by ergonomic OpenAPI / WASM support, but anything that can support JSON => JSON over HTTP will do.
It's only on you to write mapper functions -- combine & reduce happens automatically, using reduction annotations of the collection's JSON-schema. Think of it as a SQL group-by where aggregate functions have been hoisted to the schema itself.
You can say that again. It's a bet, and I don't know how it will work out. I do think SQL is a poor fit for expressing long-lived workflows that evolve over joined datasets / schemas / transformations, or which are more operational vs analytical in nature, or which want to bring in existing code. Though I _completely_ understand why others have focused on SQL first, and it's definitely not an either / or proposition.
Ultimatelly there will never be a pure streaming system processing, one record at a time, in real world. Any such system contains a busy loop somewhere inside polling the source, say each 100ms, and unless it shares a lock with the source system, it will never guarantee that there won't be more items in the source queue within those 100ms intervals. Therefore all such systems are at best (micro)batch systems. Also streaming systems literally batch data into time windows when doing, eg. group by operation, so they turn into batch systems then.
Pure batch systems are those where the processing window is infinite and no state is preserved. Everything is recomputed from scratch on every run. This seems to be the prefered way to do ETL because dragging state around and accidentally polluting it is better to be avoided if not handled properly.
What is more useful for real world data processing would be an "incremental batch" model, in which the processing system has a memory of what it has processed so far and after comparing that against source data, it would determine what will run in the next update batch.
Sadly, the industry is plagued with either pure streaming solutions, even though most data problems are not of this nature. Or ETL and workflow systems, which are thinking in terms of pure batching model. This results in me having to implement the necessary logic for incremental loads myself while not finding these ETL frameworks very useful.
I've honestly had more luck writing scripts myself than relying on excessively complicated frameworks for ETLing out there. They only seem to convolute stuff together like Ruby on Rails back in the days, instead of separating concerns like some small http library or web microframework.
Is there anything out there on the horizon which focuses on incremental batch processing, or as the article point out, updating materialized views that I manage myself?
How would this look like? Specifically, how do you know if something has been deleted? Do you compare the primary keys in your materialized view (the last snapshot you have of the data) with the source data to know what changed? Isn't that really hard to do if they're not in the same database?
In real life most people prefer taking a full snapshot each day because they don't have good solutions to these problems in batch systems (CDC is another story).
Source data should not experience deletes or updates, otherwise backfills will not work. Deletes can be handled by mirroring source data. Updates are difficult and will need a full CDC system to capture them. Better is to negotiate with data provider to send data updates as appends and never to delete from history.
The whole point of ETL is to bring data from one database to another. The comparison of source and destination primary keys can be done in python outside of db. And should be done on entire partitions instead of individual rows. Eg. you only consider which 'day' partitions have been loaded, not which rows have been loaded.
That kind of approach is fine for special cases like time series or logs or events, but "no updates or deletes" is never going to be true for arbitrary data.
"Negotiating with data provider" is never going to happen - SAP or IBM or whatever vendor of whatever you're integrating is not going to change how their systems work to make your life easier - more likely they would use it as an opportunity to pitch their reporting solution instead.
Meaning from simple data movement, you get need for CDC on source end, then the simple incremental movement, then deduplication on target end - and that one is pretty computationally expensive.
For small data and low refresh frequencies (like singular gigabytes in source size, so hundreds of megabytes in columnar, updated daily), this dance might not be worth it compared to daily full snapshots.
I wish you were right though, my life would be hella easier.
We are probably refering to different scenarios. When purchasing data for analytics, data providers are usually sophisticated enough to know not to modify their data history. With new ones, data delivery format can be negotiated.
Data providers usually wait for a day or something worth of data to collect before validating and releasing it to customers.
For integrating some OLTP database updating in real time on the other hand, yes you will need CDC.
---
Most of data engineering is just incrementally adding new data to existing corpus and then running a big batch job to dedup, sort or partition. This last step surely is computationally expensive, but at least it is conceptually simple and can be solved by throwing hardware at it. The first part of incremental updates is what imo causes more troubles.
I do sadly have the opposite experience - "Yes, we are contractually obligated to give you data about every batch and everything we did with it - what do you mean excels with schema that differs day to day is not enough?"
...
The last step being computationally expensive kinda puts lower bound on latency of streaming replication from CDC, especially if you're trying to do it at scale and can't fine-tune each individual table partitioning.
Boy, how do I wish all data was incremental events.
I'd beg to differ; having a message broker as part of the streaming pipeline allows to set up multiple consumers (e.g. Snowflake and Elasticsearch and something else). Also, depending on the settings, you can replay streams from the beginning.
That's why we see >90% of Debezium users running with Kafka or alternatives. There is a group of point-to-point users (e.g. caching use cases) mostly via Debezium engine, but it's a small minority.
I'd say that the difference was the systems behavior in the presence of IO, and it's pretty important in my experience. Micro-batching systems hold up processing while waiting for IO but proper streaming implementations continue using cpu for elements at other points in the stream, very roughly.
Exactly. You can have the amortization benefits of batching without adding extra latency, if you do optimistic pipelining. It's super powerful if your transaction sizes aren't linear, e.g. because you're doing in-memory aggregations.
Sometime when I am old(er) and (somehow?) have more time, I'd like to jot down a "Rosetta Stone" of which buzzwords map to the same concepts. So often we change our vocabulary every decade without changing what we're really talking about.
Things started out in a scholarly vein, but the rush of commerce hasn't allowed much time to think where we're going. — James Thornton, Considerations in computer design (1963)
> Semantic MediaWiki (SMW) is an extension to MediaWiki that allows for annotating semantic data within wiki pages, thus turning a wiki that incorporates the extension into a semantic wiki. Data that has been encoded can be used in semantic searches, used for aggregation of pages, displayed in formats like maps, calendars and graphs, and exported to the outside world via formats like RDF and CSV.
As someone who’s spent a lot of time working on data pipelines, I think this is a great breakdown of the complexity most data engineers are facing. However, I think there’s two more keys to tidying up messy pipelines in practice:
1. You need to colocate both stream processing for the data pipeline and real-time materialized view serving for the results.
2. You need one paradigm for expressing both of these things.
Let me try to describe a bit why that is.
1. You virtually always need both stream processing and view serving in practice. In the real-world, you ingest data streams from across the company and generally don’t have a say about how the data arrives. Before you can do the sort of materialization the author describes, you need to rearrange things a bit.
2. Building off of (1), if these two aren’t conceptually close, it becomes hard to make the whole system hang together. You still effectively have the same mess—it’s just spread over more components.
This is something we’re working really hard on solving at Confluent. We build ksqlDB (https://ksqldb.io/), an event streaming database over Kafka that:
1. Let’s you write programs that do stream processing and real-time materialized views in one place.
2. Let’s you write all of it in SQL. I see a lot of people on this post longing for bash scripting, and I get it. These frameworks are way too complicated today. But to me, SQL is the ideal medium. It’s both concise and deeply expressive. Way more people are competent with SQL, too.
3. Has built-in support for connecting to external systems. One other, more mundane part of the puzzle is just integrating with other systems. ksqlDB leverages the Kafka Connect ecosystem to plug into 120+ data systems.
As someone who basically sticks everything possible into Postgres, this is interesting! Streaming tools don't automatically cache things you need? I guess it's about time they do! Postgres, for instance, has a robust LRU mechanism that deals with OLTP quite competently. OLAP too if your indices are thought-out.
Also, although built-in materialized views don't allow partial updates in Postgres, you can get a similar thing with normal tables and triggers. Hashrocket discussed that strategy here-- https://hashrocket.com/blog/posts/materialized-view-strategi... .
Of the traditional RDBMSs, I believe Oracle has the most comprehensive support for materialized views, including for incremental refreshes [0].
As early as 2004, developers using Oracle were figuring out how to express complex constraints declaratively (i.e. without using application code or database triggers) by enforcing them on materialized views [1].
It's quite impressive, but this level of sophistication in what materialized views can do and how they are used does not seem to have spread far beyond Oracle.
SQL Server has indexed views, which is "real time" refresh since it's a separate clustered index that's written to at the same time as the base tables.
Do you know how sophisticated SQL Server is about updating indexed views? How granular are the locks when, for example, an indexed aggregate is updated? That will have a big impact on write performance when there are many concurrent updates.
A long time ago, I tried to use SQL Server indexed views to maintain a bank account balances table based on transaction history [0].
I forget what I ended up doing, but I remember that one of the downsides of using indexed views was that they didn't support any constraints. There are many restrictions on what you can and can't put in a SQL Server indexed view [1].
In this regard, I think Oracle has a more mature materialized view offering, though I personally haven't used Oracle much and don't know how well their materialized views work in practice.
Well, indexed views aren't materialized views per se, they are a tradeoff between maintenance and deterministic performance.
A materialized view is nothing more than a snapshot cache, a one time ETL job. So it can abide by any constraints and is completely untethered from the data that created it. So you have to create your own maintenance cycle, including schema validation and any dynamic / non-deterministic aspects of the MV.
An indexed view is modified just like the clustered index of any tablr object upon which it depends, as an affected "partition" of the DML. That's what the SCHEMABINDING keyword is for, binding the view to any DML statements of its underlying base table(s).
So no need to maintain it at all, at the expense of conforming to a fairly rigid set of constraints to ensure that maintenance is ..umm ..maintainable.
In practice most views' logic are perfectly simpatico with the constraints of an indexed view - the tradeoff is write performance vs the "cache hit" of your view.
I do way more OLAP/HTAP engineering in my day job so indexed views are less common vs. Columnstores, but indexed views are a highly underutilized feature of SQL Server.
if sql server can't incrementally maintain your view then it doesn't let you index it. that's why there are so many restrictions but on the flip side if you manage to workaround the restrictions then decent update performance is assured.
As an ex-Oracle DBA, I really do miss native materialized view support in the likes of Postgres. They are /so/ powerful. People love to bash Oracle, but this is a good example of a feature that the open source DBs haven't got close to yet.
Most problems of data engineering of today would be solved in presence of a tool in which I would define arbitrary transformation of a say a single daily data increment and the system would handle the state management and loading of all of the increments. Regardless of if they came from source updates or backfills.
Data engineering really is just a maintenance of incrementally updated materialized views, but no tool out there yet recognizes it.
They at best help you orchestrate and parallelize your ETLs across multiple threads and machines. They become glorified makefiles at the cost of introducing several layers of infrastructure into the picture (eg. Airflow) for what should have been solved by simple bash scripting.
Yet at best these tools only help with stateless batch processing. When it comes down to stateful processing, which is necessary for maintaining an incrementally updated materialized views and idempotent loads, I have to couple the logic of view state management (what has been loaded so far) with logic of the actual data transformation.
Response to difficulties of batch ETL from the industry is usually: batch data processing systems are resource hungry and slow, all you need from now is streaming.
No, actually I don't. For data analytics, pure streaming almost has no application. Data analytics is essentially data compression of big data to something smaller. Ie. some form of group by. I have to wait for a window of data to get close before computing anything useful. Analytics on real "real time" data on unclosed windows is confusing and useless.
So all data analytics will ever run on groups, windows and batches of data. Therefore I need a system which will help me run data transformations on batches. More precisely - stream of smaller batches. I need this to react to incoming daily, hourly or minutely batches and I need this to backfill my materialized view in the case I decide to wipe it off and start again.
You can literally do this in what was supposed to be the original system to orchestrate bunch of programs - shell scripting. And you'll be happier for it than using current complex frameworks. Only things you will miss is something to run distributed cron and to distribute load to multiple machines. At least the latter can be handled by gnu parallel.
This article hits the nail on its head with describing what conceptual model for ETL actually is and once others will follow, we might finally see new frameworks or just libraries to help us to greatly simplify ETLs. Perhaps one day data engineering will be just as simple as running an idempotent bash or python or sql script or even close to nonexistent.
https://www.getdbt.com/ comes extremely close in my eyes and even tackles the documentation and infra-as-code aspect. We went all in half a year ago and never looked back.
DBT is interesting, but is far from what I'm describing.
1. It is only for structured SQL, not for arbitrary data. I can't use it to unpack raw zipped data for example
2. It couples logic for data transformation and view state management. Actually it makes you do it yourself, so it doesn't really help at all. You'll get burned by storing view state together with your data, eg. when a batch increment contains no data.
3. It is not built with "incremental materialized views" in mind. It still thinks in a batch refill mode and incremental mode according to this [0].
It is certaily an improvement over managing sql scripts by hand, but far from the ultimate goal of maintaining materialized views in a declarative way.
How do you do testing out of interest? Whenever I have seen dbt used, it is usually data analysts creating new tables on the fly in data warehouse scenarios.
Maybe I am just too used to application developer workflows where models are defined in code and then there are ORMs and schema migration tools to help manage all that.
I had in mind OLAP use cases in environments with lot of both unstructured and structured tabular data. Some kind of scripting is necessary just to structure bunch of text and jsons into tables.
Ravendb seems like OLTP NoSQL database on the other hand.
Great post! Just heard about this from one of our customers who slacked me with "He is describing Ascend.io!" :-)
Having spent 15+ years writing big data pipelines and building teams who do the same, I couldn't agree more... the conceptual model we're all quite comfortable with is this notion of cascading, materialized views. The challenge, however, is that they are expensive to maintain in a big data pipeline context -- paid either in system resource cost, or developer cost. The only reasonable way to achieve this is a fundamental shift away from imperative pipelines, and to declarative orchestration (a few folks mention this as well). We've seen this in other domains with technologies like React, Terraform, Kubernetes, and more to great success.
so each data pipeline is a pure function ? hmm geez if we had something that was all about pure functions and how they can be used to express real life problems.
Are you thinking of a specific implementation of materialized views? Most implementations from traditional RDBMSs would indeed be too limiting to use as a general data pipeline building block.
The post doesn't argue that, though. It's more about using materialized views as a conceptual model for understanding data pipelines, and hinting at some recent developments that may finally make them more suitable for more widespread use.
From the conclusion:
> The ideas presented in this post are not new. But materialized views never saw widespread adoption as a primary tool for building data pipelines, likely due to their limitations and ties to relational database technologies. Perhaps with this new wave of tools like dbt and Materialize we’ll see materialized views used more heavily as a primary building block in the typical data pipeline.
You can absolutely get complicated data models nailed using views. Some of the views get unwieldly and a bit long but it can be done. The catch is in incrementally loading the aggregate tables or self referencing (even then the underlying views are essentially functions to be included). I scanned the article and have followed materialize.io for a bit and built pipelines that handle what is essentially the awfulness of performantly updating materialized views.
I'm not a master but believe a core piece of truth for data:
Move the data as little as possible
If you can use federated query (cost/performance is acceptable), do so. If you can use materialized views, do so. Data replication has tons of issues, you almost always have the 2 generals problem and reconciliation / recompile procedures. If you don't move the data, the original storage is the source and it is always right.
I think I went way out of responding directly to you, I do think materialize.io and delta lake and declarative pipelines are the solution to 95% of the data problems out there.
I'm speaking conceptually about materialized views as system implementations differ.
Nearly all data sources, including the changelogs of databases, are polling-based APIs, so you’re getting data from the source in (small) batches. If your goal is to put this data into a data warehouse like Snowflake, or a system like Materialize, the lowest latency thing you can do is just immediately put that data into the destination. I sometimes see people put a message broker like Kafka in the middle of this process, thinking it’s going to imbue the system with some quality of streamyness, but this can only add latency. People are often surprised that we don’t use a message broker at Fivetran, but when you stop and think about it there’s just no benefit in this context.