Hacker News new | past | comments | ask | show | jobs | submit login
Materialize Raises a $32M Series B (materialize.com)
201 points by austinbirch 6 months ago | hide | past | favorite | 86 comments

Materialize has tackled the hardest problem in data warehousing, materialized views, which has never really been solved, and built a solution on a completely new architecture. This solution is useful by itself, but I'm also watching eagerly how their road map [1] plays out, as they go back and build out features like persistence and start to look more like a full-fledged data warehouse, but one with the first correct implementation of materialized views.

[1] https://materialize.com/blog-roadmap/

For a primer on materialized views, and one of the key rationales for Materialize's existence, there's no better presentation than Martin Kleppman's "Turning the Database Inside-Out" (2015). (At my company it's required viewing for engineers across our stack, because every data structure is a materialized view no matter where on frontend or backend that data structure lives.)


Confluent is building an incredible business helping companies to build these types of systems on top of Kafka, Samza, and architectural principles originally developed at LinkedIn, but more along the lines of "if you'd like this query to be answered, or this recommender system to be deployed for every user, we can reliably code a data pipeline to do so at LinkedIn scale" than "you can run this query right away against our OLAP warehouse without knowing about distributed systems." (If it's more nuanced than this please correct me!)

On the other hand, Materialize could allow businesses to realize this architecture, with its vast benefits to millisecond-scale data freshness and analytical flexibility, simply by writing SQL queries as if it was a traditional system. As its capabilities expand beyond parity with SQL (though I agree that's absolutely the best place for them to start and optimize), there are tremendous wins here that could power the next generation of real-time systems.

EDIT: some clarifications and additional examples

I also wrote a primer for why the world needs Materialize [1]. It had a big discussion on HN [2], and Materialize's cofounder said it was part of his motivation [3].

[1] https://medium.com/@lironshapira/data-denormalization-is-bro...

[2] https://news.ycombinator.com/item?id=12613586

[3] https://twitter.com/narayanarjun/status/1241450203095465986

Ha! Your blog post was one of the reasons that I trusted in the future of Materialize enough to decide to work here!

I agree, that is exactly the problem that I, in particular, think we are solving.

That's awesome, thanks for fixing denormalization :)

What exactly are "materialized views"?

It's a query of which you save the results in a cache database table, so next time when it is queried, you can provide the results from the cache.

Typically, in a traditional RDBMS, the query is defined as a sql view, which you either have to manually refresh, or can be refreshed periodically.

Using streaming systems like kafka, it's possible to continously update the cached results based in the incoming data, so the result is a near realtime up to date query result.

Writing the stream processing to update the materialized view can be complex, using SQL like materialize enables you to do, makes it a lot more productive.

“In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary using an aggregate function.”


Let's start with views. A database view is a "stored query" that presents itself as a table, that you can further query against.

If you have a view "bar":

    CREATE VIEW bar AS $$
    SELECT x * 2 AS a, y + 1 AS b FROM foo
and then you `SELECT a FROM bar`, then the "question" you're really asking is just:

    SELECT a FROM (SELECT x * 2 AS a, y + 1 AS b FROM foo)
— which, with efficient query planning, boils down to

    SELECT x * 2 AS a FROM foo
It's especially important to note that the `y + 1` expression from the view definition isn't computed in this query. The inner query from the view isn't "compiled" — forced to be in some shape — but rather sits there in symbolic form, "pasted" into your query, where the query planner can then manipulate and optimize/streamline it further, to suit the needs of the outer query.


To materialize something is to turn it from symbolic-expression form, into "hard" data — a result-set of in-memory row-tuples. Materialization is the "enumeration" in a Streams abstraction, or the "thunk" in a lazy-evaluation language. It's the master screw that forces all the activity dependent on it — that would otherwise stay abstract — to "really happen."

Databases don't materialize anything unless they're forced to. If you do a query like

    SELECT false FROM (SELECT * FROM foo WHERE x = 1)
...no work (especially no IO) actually happens, because no data from the inner query needs to be materialized to resolve the outer query.

Streaming data out of the DB to the user requires serialization [= putting the data in a certain wire format], and serialization requires materialization [= having the data available in memory in order to read and re-format it.] So whatever final shape the data returned from your outermost query has when it "leaves" the DB, that data will always get materialized. But other processes internal to the DB may sometimes require data to be materialized as well.

Materialization is costly — it's usually the only thing forcing the DB to actually read the data on disk, for any columns it wasn't filtering by. Many of the optimizations in RDBMSes — like the elimination of that `y + 1` above — have the goal of avoiding materialization, and the disk-reads / memory allocations / etc. that materialization requires.


Those definitions out of the way, a "materialized view" is something that acts similar to a view (i.e. is constructed in terms of a stored query, and presents itself as a queriable table) but which — unlike a regular view — has been pre-materialized. The query for a matview is still stored, but at some point in advance of querying, the RDBMS actually runs that query, fully materializes the result-set from it, and then caches it.

So, basically, a materialized view is a view with a cached result-set.

Like any cache, this result-set cache increases read-time efficiency in the case where the original computation was costly. (There's no point in "upgrading" a view into a matview if your queries against the plain view were already cheap enough for your needs.)

But like any cache, it needs to be maintained, and can become out-of-sync with its source.

Although materialized views are part of the SQL standard, not all SQL RDBMSes implement them. MySQL/MariaDB does not, for example. (Which is why you'll find that much of the software world just pretends matviews don't exist when designing their DB architectures. If it ever needs to run on MySQL, it can't use matviews.)

The naive approach that some other RDBMSes (e.g. Postgres) take to materialized views, is to only offer manual, full-pass recalculation of the cached result-set, via some explicit command (`REFRESH MATERIALIZED VIEW foo`). This works with "small data"; but at scale, this approach can be so time-consuming for large and complex backing queries, that by the time cache is rebuilt, it's already out-of-date again!

Because there are RDBMSes that either don't have matviews, or don't have scalable matviews, many application developers just avoid the RDBMS's built in matview abstraction, and build their own. Thus, another large swathe of the world's database architecture either will use cron-jobs to regular run+materialize a query, and then dump its results back into a table in the same DB; or it will define on-INSERT/UPDATE/DELETE triggers on "primary" tables, that transform and upsert data into "secondary" denormalized tables. These are both approaches to "simulating" matviews, portably, on an RDBMS substrate that isn't guaranteed to have them.

Other RDBMSes (e.g. Oracle, SQL Server, etc.) do have scalable materialized views, a.k.a. "incrementally materialized" views. These work less like a view with a cache, and more like a secondary table with write-triggers on primary tables to populate it — but all handled under-the-covers by the RDBMS itself. You just define the matview, and the RDBMS sees the data-dependencies and sets up the write-through data flow.

Incrementally-materialized views are great for what they're designed for (reporting, mostly); but they aren't intended to be the bedrock for an entire architecture. Building matviews on top of matviews on top of matviews gets expensive fast, because even fancy enterprise RDBMSes like Oracle don't realize, when populating table X, that writing to X will in turn write to matview Y, which will in turn "fan out" to matviews {A,B,C,D}, etc. These RDBMS's matviews were never intended to support complex "dataflow graphs" of updates like this, and so there's too much overhead (e.g. read-write contention on index locks) to actually make these setups practical. And it's very hard for these DBMSes to change this, as their matviews' caches are fundamentally reliant on database table storage engines, which just aren't the right ADT to hold data with this sort of lifecycle.


Materialize is an "RDBMS" (though it's not, really) engineered from the ground up to make these sorts of dataflow graphs of matviews-on-matviews-on-matviews practical, by doing its caching completely differently.

Materialize looks like a SQL RDBMS from the outside, but Materialize is not a database — not really. (Materialize has no tables. You can't "put" data in it!) Instead, Materialize is a data streaming platform, that caches any intermediate materialized data it's forced to construct during the streaming process, so that other consumers can work off those same intermediate representations, without recomputing the data.

If you've ever worked with Akka's Streams, or Elixir's Flows, or for that matter with Apache Beam (nee Google Dataflow), Materalize is that same kind of pipeline. But where all the plumbing work of creating intermediate representations — normally a procedural map/reduce/partition kind of thing — is done by defining SQL matviews; and where the final output isn't a fixed output of the pipeline, but rather comes from running an arbitrary SQL query against any arbitrary matview defined in the system.

> Most RDBMSes (e.g. Postgres) only offer manual (`REFRESH MATERIALIZED VIEW foo`) full-pass recalculation of the cached result-set for matviews.

"Most" here seems very much wrong, at least of major products: Oracle has an option for on-commit (rather than manual) and incremental/incremental-if-possible (FAST/FORCED) refresh, so it is limited to neither only-manual nor only-full-pass recalculation. SQL Server indexed views (their matview solution) are automatically incrementally updated as base tables change, they don't even have an option for manual full-pass recalculation, AFAICT. DB2 materialized query tables (their matview solution) have an option for immediate (on-commit) refresh (not sure if the algo here is always full-pass, but its at a minimum not always manual.) Firebird and MySQL/MariaDB don't have any support for materialized views at all (though of course you can manually simulate them with additional tables updated by triggers.) Postgres seems to be the only major RDBMS with both material view support and the limitation of only on-demand full-pass recalculation of matviews (for that matter, except maybe DB2 having the full-pass limitation, it seems to be the only one with either the only-manual or only-full-pass limitation.)

I think that it's true that many databases offer incremental updates and it's incorrect to say that manual refreshes were the state of the art.

The important point is that Materialize can do it for almost any query, very efficiently, compared to existing options. That opens a lot of possibilities.

> The important point is that Materialize can do it for almost any query, very efficiently, compared to existing options. That opens a lot of possibilities.

Yes, this does seem like a very big deal.

You're right; I updated my comment.

That was a fantastic and illuminating update, thank you.

This is a fantastic comment!

One small thing: we do now have tables[1]! At the moment they are ephemeral and only support inserts -- no update/delete. We will remove both of those limitations over time, though!

[1]: https://materialize.com/docs/sql/create-table/

This is an outstanding explanation. Much better than mine.

updated results of a query - eg if you do some aggregation or filtering on a table, or join two tables, or anything of the sort - materialized view will give you the updated results of the query in a separate table

Suppose you have normalized your data schema, up to at least 3NF, perhaps even further up to 4NF, 5NF or (as Codd intended) BCNF.

Great! You are now largely liberated from introducing many kinds of anomaly at insertion time. And you'll often only need to write once for each datum (modulo implementation details like write amplification), because a normalised schema has "a place for everything and everything in its place".

Now comes time to query the data. You write some joins, and all is well. But a few things start to happen. One is that writing joins over and over becomes laborious. What you'd really like is some denormalised intermediary views, which transform the fully-normalised base schema into something that's more convenient to query. You can also use this to create an isolation layer between the base schema and any consumers, which will make future schema changes easier and possibly improve security.

The logical endpoint of doing so is the Data Warehouse (particularly in the Kimball/star schema/dimensional modelling style). You project your normalised data, which you have high confidence in, into a completely different shape that is optimised for fast summarisation and exploration. You use this as a read-only database, because it massively duplicates a lot of information that could otherwise have been derived via query (for example, instead of a single "date" field, you have fields for day of week, day of month, day of year, week of year, whether it's a holiday ... I've built tables which include columns like "days until major conference X" and "days since last quarterly release").

Now we reach the first problem. It's too slow! Projecting that data from the normalised schema requires a lot of storage and compute. You realise after some scratching that your goal all along was to pay that cost upfront so that you can reap the benefits at query time. What you want is a view that has the physical characteristics of a table. Meaning you want to write out the results of the query, but still treat it like a view. You've "materialized" the view.

Now the second problem. Who, or what, does that projection? Right now that role is filled by ETL, "Extract, Transform and Load". Extract from the normalised system, transform it into the denormalised version, then load that into a data warehouse. Most places do this on a regular cadence, such as nightly, because it just takes buckets and buckets of work to regenerate the output every time.

Now enters Materialize, who have a secret weapon: timely dataflow. The basic outcome is that instead of re-running an entire view query to regenerate the materialized view, they can, from a given datum, determine exactly what will change in the materialized view and only update that. That makes such views potentially thousands of times cheaper. You could even run the normalised schema and the denormalised projections on the same physical set of data -- no need for the overhead and complexity of ETL, no need to run two database systems, no need to wait (without the added complexity of a full streaming platform).

That's a great description! Does materialize describe how they implement timely dataflow?

At my current company, we have built some systems like this. Where a downstream table is essentially a function of a dozen upstream tables.

Whenever one of the upstream tables changes, it's primary key is published to a queue, some worker translates this upstream primary key into a set of downstream primary keys, and publishes these downstream primary keys to a compacted queue.

The compacted queue is read by another worker, that "recomputes" each dirty key, one-at-a-time, which involves fetching the latest-and-greatest version of each upstream table.

This last worker is the bottleneck, but it's optimized by per-key caching, so we only fetch the latest-and-greatest version once per update. It can also be safely and arbitrarily parallelized, since the stream they read from is partitioned on key.

> Does materialize describe how they implement timely dataflow?

It's open source (https://github.com/TimelyDataflow/timely-dataflow), and also extensively written about both in academic research papers and documentation for the project itself. The GitHub repo has pointers to all of that. See also differential dataflow (https://github.com/timelydataflow/differential-dataflow).

Here's a 15-minute introduction to Timely Dataflow by Frank, our co-founder: https://www.youtube.com/watch?v=yOnPmVf4YWo

How is this different from Apache flink real time SQL support?


I think the main distinction is around "interactivity" and how long it takes from typing a query to getting results out. Once you stand up a Flink dataflow, it should move along a brisk clip. But standing up a new dataflow is relatively heavy-weight for them; typically you have to re-flow all of the data that feeds the query.

Materialize has a different architecture that allows more state-sharing between operators, and allows many queries to spin up in milliseconds. Primarily, this is when your query depends on existing relational data in pre-indexed form (e.g. joining by primary and foreign keys).

You can read a bit more in an overview blog post [0] and in more detail in a VLDB paper [1]. I'm sure there are a number of other quirks distinguishing Flink and Materialize, probably several in their favor, but this is the high-order bit for me.

[0]: https://materialize.com/materialize-under-the-hood/ [1]: http://www.vldb.org/pvldb/vol13/p1793-mcsherry.pdf

How were previous implementations of materialized views deficient?

Here's a nice writeup of Materialize:


Not really mentioned here, but in standard postgres it might be quite expensive to update the view so you can only do it periodically. Materialize keeps that up-to-date continuously.

Joins were unavailable or subject to extreme limitations. Or just plain wrong!

RDBMSes enable you to create materialized views only for data in the database.

Materialize enables you to do this for any streaming data source in your organization, with the ease of writing SQL.

This enables you to simply write a SQL statement joining data from Salesforce + SAP + Siebel as soon as the data changes, and store the results as a near real-time up to date database table.

It does depend on a lot of underlying plumbing: streaming platform (e.g. kafka), and streaming data sources (e.g., kafka connect + debezium).

Isn't this pretty similar to what Dremio does?

Dremio is a batch processor, not a stream processor. The fundamental difference is that a batch processor will need to recompute a query from scratch whenever the input data changes, while a stream processor can incrementally update the existing query result based on the change to the input.

This can make a huge difference when making small changes to large datasets. Materialize can incrementally compute small changes to very complicated queries in just a few milliseconds, while with batch processors you're looking at latency in the hundreds of milliseconds, seconds, or minutes, depending on the size of the data.

Another way of looking at it is that in batch processors, latency scales with the size of the total data, while in stream processors, latency scales with the size of the updates to the data.

I see, thank you for the explanation!

Materialize can help us manifest The Web After Tomorrow [^1].

My previous comments persuading you why DDF is so crucial to the future of the Web:

> "There is a big upset coming in the UX world as we converge toward a generalized implementation of the "diff & patch" pattern which underlies Git, React, compiler optimization, scene rendering, and query optimization." — https://news.ycombinator.com/item?id=21683385 also with links to prior art like Adapton and Incremental.

> "DD (Differential Dataflow) is commercialized in Materialize" — https://news.ycombinator.com/item?id=24846119

> "Materialize exists to efficiently solve the view maintenance problem" https://news.ycombinator.com/item?id=22888396

    [^1]: https://tonsky.me/blog/the-web-after-tomorrow/

Thanks for this, I'm glad to see I'm not the only one tired of writing everything twice (once in the frontend and once in the backend). I'll revisit the links later.

I'm glad more people are tackling this problem. There still isn't a good solution to real-time aggregation data at large scale.

At a previous company, we dealt with huge data streams (~1TB data / minute) and our customers expected real-time aggregations.

Making an in-house solution for this was incredibly difficult because each customer's data differed wildly. For example:

- Customer A's shards might have so much cardinality where memory becomes an issue.

- Customer B's shards might have so much throughput where CPU becomes a constraint. Sometimes a single aggregation may have so much throughput where you need to artificially increase the cardinality and aggregate the aggregations!

This makes the optimal sharding strategy very complex. Ideally, you want to bin-pack memory-constrained aggregations with CPU-constrained aggregations. In my opinion, the ideal approach involves detecting the cardinality of each shard and bin-packing them.

I've always found that when you are solving a concrete problem, like you were, it's vastly easier than the case of a general-purpose database because you can make all the tradeoffs that benefit your exact use case. but it sounds like that's not what you experienced. was it just how heterogeneous the clients' needs were? I guess what I'm saying is, if you are capable of handling 1TB/minute, seems like you're plenty able to and would want to be designing the system yourself - but interested what I'm missing about this.

Late to the post, but if anyone wants a good primer on Materialize (beyond what their actual engineers and a cofounder are saying in the comments), check out the Materialize Quarantine Database Lecture: https://db.cs.cmu.edu/events/db-seminar-spring-2020-db-group...

The actual talk seems to be here: https://www.youtube.com/watch?v=9XTg09W5USM

Thanks! Accidentally copied the wrong link in haste.

The headline refers to "incrementally updated materialize views". How does a company get funding for a feature that has already existed in other DBs for at least a decade?

E.g, Vertica refers to this as Live Aggregate Projections.

It's a cool concept but comes with huge caveats. Keeping tracking of non-estimated cardinality for COUNT DISTINCT type queries, as an example.

(Disclaimer: I'm one of the engineers at Materialize.)

> How does a company get funding for a feature that has already existed in other DBs for at least a decade? ... It's a cool concept but comes with huge caveats.

I think you answered your own question here. Incrementally-maintained views in existing database systems typically come with huge caveats. In Materialize, they largely don't.

Most other systems place severe restrictions on the kind of queries that can be incrementally maintained, limiting the queries to certain functions only, or aggregations only, or only queries without joins—or if they do support maintaining joins, often the joins must occur only on the involved tables' keys. In Materialize, by contrast, there are approximately no such restrictions. Want to incrementally-maintain a five-way join where some of the join keys are expressions, not key columns? No problem.

That's not to say there aren't some caveats. We don't yet have a good story for incrementally-maintaining queries that observe the current wall-clock time [0]. And our query optimizer is still young (optimization of streaming queries is a rather open research problem), so for some more complicated queries you may not get the resource utilization you want out of the box.

But, for many queries of impressive complexity, Materialize can incrementally-maintain results far faster than competing products—if those products can incrementally maintain those queries at all.

The technology that makes Materialize special, in our opinion, is a novel incremental-compute framework called differential dataflow. There was an extensive HN discussion on the subject a while back that you might be interested in [1].

[0]: https://github.com/MaterializeInc/materialize/issues/2439

[1]: https://news.ycombinator.com/item?id=22359769

This is one of my favorite types of HN comments: admits the bias upfront, offers a meaningful technical answer, and links to relevant documents for a deeper dive. Thank you so much!

Thanks for the explanation. I'm going to look more into this as I'm working on a new service on top of Vertica. There is a lot I don't like about Vertica and don't see alternatives such as Snowflake to be much of an improvement.

Hi - I'm enjoying reading the discussion around this, and the previous discussion [1] as well. It's possible that Materialize can help us transition a really complex pipeline to real-time.

To the short discussion here [0] about window functions - any update to that in the last 9 months?

Our workloads involve, in a lot of cases, ingesting records, and keeping track of whether N records of a similar type have been seen within any 15 minute interval. The records do not arrive in chronological order. Is this currently a potential use case for Materialize?

[0] https://news.ycombinator.com/item?id=22362106

What about the other big problem ignored here: does your streaming platform separate compute and storage?

Because GCP DataFlow does. Flink doesn't. DataFlow allows you to elastically scale the compute you need (Snowflake, Databricks). If you can't do that, materialized views will be a more niche feature for bigger 24x7 deployments with predictable workflows.

As George points out above, we haven’t added our native persistence layer yet. Consistency guarantees are something we care a lot, so for many scenarios, we leverage the upstream datastore (often Kafka).

But to answer your question, yes, our intention is to support separate cloud-native storage layers.

My dim and distant recollection is that Beam and/or GCP Data Flow require someone to implement PCollections and PTransforms to get the benefit of that magic. That's not a trivial exercise, compared to writing SQL.

Hi, I work at Materialize.

You can read about Vertica's "Live Aggregate Projections" here:


In particular, there are important constraints like (among others)

> The projections can reference only one table.

In Materialize you can spin up just about any SQL92 query, join eight relations together, have correlated subqueries, count distinct if you want. It is then all maintained incrementally.

The lack of caveats is the main difference from the existing systems.

Materialize is the real deal - completely different architecture under the hood. Origin project is Timely Dataflow & Naiad.


> The headline refers to "incrementally updated materialize views". How does a company get funding for a feature that has already existed in other DBs for at least a decade?

They're getting funding for doing it much more efficiently.

I read into the background papers when it first popped up. This is legitimate, deep computer science that other DBs don't yet have.

> All of this comes in a single binary that is easy to install, easy to use, and easy to deploy.

And it looks like they chose a sensible license for that binary [1], so they're not giving too much away.

I wonder though if they could have made this work as a bootstrapped business, so they would answer only to customers, not to investors chasing growth at all costs.

[1]: https://materialize.com/download/

Bootstrapping is fun until you can't make payroll.

If your goal is an exit, and you can raise this much, why not.

I'm so psyched about Materialize.

An old coworker explained to me about how his previous company used DBT to create many different projections of messy data to serve many applications, rather than trying to come up with the One Canonical Representation. It truly blew my mind in terms of thinking about how to model data within a business.

The huge limitation with this vision is that it only works in places where you can tolerate some pretty significant staleness. So the promise of this approach excludes most OLTP applications. I simply assumed it wouldn't be reasonable to create something that allows for unconstrained SQL-based transformations in real time, and that no one was working on this. Oh well.

But several months back, I discovered Materialize and it was an "oh shit" moment. Someone was actually doing this, and in a very first principles-driven approach. I'm really excited for how this project evolves.

dbt is fantastic. It depends on your usecase, but for ours having an hourly sync of data is fine for reporting.

We're super excited about materialize at Fishtown Analytics (the company that makes dbt). I think that the "reporting" use-case is well served by Snowflake/BigQuery/etc, but I do think that operational use-cases are left behind by batch-based transformation models.

The thing I'm most excited about in materialize is the ability to create Sinks (https://materialize.com/docs/sql/create-sink/). The combination of 1) streaming transformations over realtime source data and 2) streaming outputs into external systems feels like the way of the future IMO.

I saw a dbt-materialize plugin (https://github.com/jwills/dbt-materialize) out there in the wild. My guess is it's not ready for primetime yet, but would love to bake support for materialize into dbt when the time is right :)

I wonder if BSL becomes the new standard for open source commercial products. It's a good trade-off between freedom and real world business pressure.

Doubt it. Lots of aversion to the license given its limited use and some ambiguous terms/education around the various windows.

I think it can be, I know a few other potentially successful examples like CockroachDB and ZeroTier. The BSL license makes the entire project basically FOSS for you and me, but not for the big sharks. Which I guess is much better for the world compared to open-core and of course proprietary SaaS.

Don’t forget to keep your eyes on the architectural concept of Command Query Record Separation (CQRS).

When combined with event sourcing [1], there is a new unified architecture possible that solves the problem that microservices create by fragmenting data [2], and performant querying on data updating in real time.

This architecture represents more complexity but increased flexibility.

I recently saw this article about federated GraphQL [3], and while a cool idea and probably the ultimate solution (API composition), I expect that with network and physical boundaries between services still adding latency, we need materialized views as part of the architecture to compensate for the overhead of bringing together aggregate root objects from multiple systems.

[1] https://www.confluent.io/blog/event-sourcing-cqrs-stream-pro...

[2] https://microservices.io/patterns/data/cqrs.html

[3] https://netflixtechblog.com/how-netflix-scales-its-api-with-...

Can you point me at documentation for the fault tolerance of the system? A huge issue for streaming systems (and largely unsolved AFAIK) is being able to guarantee that counts aren't duplicated when things fail. How does Materialize handle the relevant failure scenarios in order to prevent inaccurate counts/sums/etc?

Hi! I work at Materialize.

I think the right starter take is that Materialize is a deterministic compute engine, one that relies on other infrastructure to act as the source of truth for your data. It can pull data out of your RDBMS's binlog, out of Debezium events you've put in to Kafka, out of local files, etc.

On failure and restart, Materialize leans on the ability to return to the assumed source of truth, again a RDBMS + CDC or perhaps Kafka. I don't recommend thinking about Materialize as a place to sink your streaming events at the moment (there is movement in that direction, because the operational overhead of things like Kafka is real).

The main difference is that unlike an OLTP system, Materialize doesn't have to make and persist non-deterministic choices about e.g. which transactions commit and which do not. That makes fault-tolerance a performance feature rather than a correctness feature, at which point there are a few other options as well (e.g. active-active).

Hope this helps!

This is a solved problem, for a few years now. The basic trick is to publish "pending" messages to the broker which are ACK'd by a later written message, only after the transaction and all it's effects have been committed to stable storage (somewhere). Meanwhile, you also capture consumption state (e.x. offsets) into the same database and transaction within which you're updating the materialization results of a streaming computation.

Here's [1] a nice blog post from the Kafka folks on how they approached it.

Gazette [2] (I'm the primary architect) also solves in with some different trade-offs: a "thicker" client, but with no head-of-line blocking and reduced end-to-end latency.

Estuary Flow [3], built on Gazette, leverages this to provide exactly-once, incremental map/reduce and materializations into arbitrary databases.

[1]: https://www.confluent.io/blog/exactly-once-semantics-are-pos...

[2]: https://gazette.readthedocs.io/en/latest/architecture-exactl...

[3]: https://estuary.readthedocs.io/en/latest/README.html

Interesting! I'm going to read into the info you linked. Thanks for the info!

This is interesting given what AWS just announced (AWS Glue Elastic Views):


This is maybe a silly question, but what's the difference between timely dataflow and Spark's execution engine? From my understanding they're doing very similar things - break down a sequence of functions on a stream of data, parallelize them on several machines, and then gather the results.

I understand that the feature set of timely dataflow is more flexible than Spark - I just don't understand why (I couldn't figure it out from the paper, academic papers really go over my head).

congrats to Frank McSherry and the rest of the materialized team! very impressed by your project.

Does anyone know how Materialize stacks up against VIATRA in terms of performance? VIATRA seems very similar to Materialize. They have multiple algorithms implemented to incrementalize queries, including Differential Dataflow. The main difference seems to be that it's based on Graph Patterns instead of SQL.

It's a good question, but you'd have to ask them I think. Tamas (from Itemis) and I were in touch for a while, mostly shaking out why DD was out-performing their previous approach, but I haven't heard from him since.

My context at the time was that they were focused on doing single rounds of incremental updates, as in a PL UX, whereas DD aims at high throughput changes across multiple concurrent timestamps. That's old information though, so it could be very different now!

Thanks for the reply!

A while ago (2018), the people behind VIATRA performed a cross-technology benchmark where they compared their performance to 9 other incremental and non-incremental solutions (Neo4j, Drools, OCL, SQLite, MySQL, among others) [1]. Perhaps it could be interesting to rerun that benchmark while including Materialize?

This would give us a direct comparison between Materialize and other existing solutions. Their benchmark is however based on a kind of UX case, so the tests might be a bit biased towards that use case.

[1] The Train Benchmark: cross-technology performance evaluation of continuous model queries

is there a single comprehensive list of restrictions on what can and can't be materialized? for example, if SQL Server can't efficiently maintain your materialized view then it doesn't let you create it - the whole list of restrictions is here: https://docs.microsoft.com/en-us/sql/relational-databases/vi...

I'd love to be able to directly compare this with that Materialize is capable of - does a similar document exist?

It's easier to describe the things that cannot be materialized.

The only rule at the moment is that you cannot currently maintain queries that use the functions `current_time()`, `now()`, and `mz_logical_timestamp()`. These are quantities that change automatically without data changing, and shaking out what maintaining them should mean is still open.

Other than that, any SELECT query you can write can be materialized and incrementally maintained.


there are messages like this in the docs:

> "WARNING! LATERAL subqueries can be very expensive to compute. For best results, do not materialize a view containing a LATERAL subquery without first inspecting the plan via the EXPLAIN statement. In many common patterns involving LATERAL joins, Materialize can optimize away the join entirely. "

I take this to mean that Materialize cannot always efficiently maintain a view with lateral joins - that's fine neither can SQL Server, but it would be nice if I could find all these exceptions in one place like I can for SQL Server.

..fwiw I prefer the behavior of failing early rather than letting potential severe performance problems into prod.

[1] https://materialize.com/docs/sql/join/#lateral-subqueries

> I take this to mean that Materialize cannot efficiently maintain a view with lateral joins [...]

Well, no this isn't a correct take. Lateral joins introduce what is essentially a correlated subquery, and that can be surprisingly expensive, or it can be fine. If you aren't sure that it will be fine, check out the plan with the EXPLAIN statement.

Here's some more to read about lateral joins in Materialize:


sorry you missed my ninja-edit - it sounds like SOME lateral join queries CAN be efficiently maintained but not ALL (not the ones that are surprisingly expensive for whatever reason) that's where the promise of "we can materialize any query!" starts to fall apart for me. presumably the surprisingly expensive cases are the ones where some rewrite rules can't guarantee correctness without hiding indexes or predicate pushdowns or whatever - the doc says review the explain plan first but what precisely about the explain plan would tell me that the materialized view won't be efficiently maintained? ideally these cases can be known ahead of time so I can come up with a conformant query rather than trying variations to see what works.

..and more to the point, there are obviously limits to what can be efficiently maintained. I would love to see that list as this is what would give me a good idea of how Materialize compares to my daily driver RDBMS which happens to be SQL Server and whose limits I'm unfortunately intimately familiar.

I don't think there is anything fundamentally different from an existing database. In all relational databases, some lateral joins can be expensive to compute. In Materialize, those same lateral joins will also be expensive to maintain.

I'd be surprised to hear you beat up postgres or SQL Server because they claim they can evaluate any SQL query, but it turns out that some SQL queries can be expensive. That's all we're talking about here.

I am genuinely interested in Materialize's capability to incrementally maintain views and I understand there are all sorts of limitations as to when that's even possible - I can't find a comprehensive list of them. I don't think it's fair to say you support every possible select statement and then just have some of them be slow. The lateral join case was the first warning I encountered in the docs - is that the ONLY case and every other possible select statement can be incrementally maintained?

All queries are incrementally maintained with the property that we do work proportional to the number of records in difference at each intermediate stage of the query plan. That includes those with lateral joins; they are not an exception.

I'm not clear on your "all sorts of limitations"; you'll have to fill me in on them?

> I'm not clear on your "all sorts of limitations"; you'll have to fill me in on them?

this feels like bait but honestly I'm under the impression that incrementally updating materialized views (where optimal = the proportion of changed records) just isn't always possible. for example, max and min aggregates aren't supported in SQL Server because updating the current max or min record requires a query to find the new max or min record - that's not considered an incremental update and so it's not supported and trying to materialize the view fails. there are a number of cases like this and a big part of problem solving with SQL Server is figuring out how to structure a view within these constraints. if you can then you can rest assured that updates will be incremental and performant - this is important because performance is the feature, if the update is slow then my app is broken. if Materialize has a list of constraints shorter than SQL Server's then you're sitting on technology worth billions - it's hard for me to believe that your list of constraints is "there are none" especially when there are explicit-but-vague performance warnings in the docs.

(Disclaimer: I'm one of the engineers at Materialize)

> for example, max and min aggregates aren't supported in SQL Server because updating the current max or min record requires a query to find the new max or min record

This isn't a requirement in Materialize, because Materialize will store values in a reduction tree (which is basically like a min / max heap) so that when we add or remove a record, we can compute a new min / max in O(log (total_number_of_records)) time in the worst case (when a record is the new min / max). Realistically, that log term is bounded to 16 (it's a 16-ary heap and we don't support more than 2^64 records). Computing the min / max this way is substantially better than having to recompute with a linear scan. This [1] provides a lot more details on how we compute reductions in Materialize.

> there are obviously limits to what can be efficiently maintained

I think we fundamentally disagree here. In our view, we should be able to maintain every view either in linear time wrt the number of updates or sublinear time with respect to the overall dataset, and every case that doesn't do so is a bug. The underlying computational frameworks [2] we're using are designed for that, so this isn't just like a random fantasy.

> if Materialize has a list of constraints shorter than SQL Server's then you're sitting on technology worth billions

Thank you! I certainly hope so!

[1]: https://materialize.com/robust-reductions-in-materialize/ [2]: https://github.com/timelydataflow/differential-dataflow/blob...

> In our view, we should be able to maintain every view either in linear time wrt the number of updates or sublinear time with respect to the overall dataset, and every case that doesn't do so is a bug.

This is awesome and I believe that should be technically possible for any query given the right data structure. The reduction tree works for min/max but is it a general solution or are there other data structures for other purposes - n per x and top subqueries come to mind. Is it all handled already or are there some limitations and a roadmap?

I'm not entirely sure what you mean by n per x, but if by top you mean something like "get top k records by group" then we support that. See [1] for more details. top-k is actually also rendered with a heap-like dataflow

When we plan queries we are rendering them into dataflow graphs that consist of one or more dataflow operators transforming data and sending it to other operators. Every single operator is designed to do work proportional to the number of changes in its inputs / outputs. For us, optimizing our performance a little bit less a matter of the right data structures, and more about expressing things in a dataflow that can handle changes to inputs robustly. But the robustness is more a question of "what do are my constant factors when updating results" and not "is this being incrementally maintained or not".

We have a known limitations page in our docs here [2] but it mostly covers things like incompleteness in our SQL support or Postgres compatibility. We published our roadmap in a blog post a few months ago here [3]. Beyond that everything is public on Github [4].

[1]: https://materialize.com/docs/sql/idioms/ [2]: https://materialize.com/docs/known-limitations/ [3]: https://materialize.com/blog-roadmap/ [4]: https://github.com/MaterializeInc/materialize

Min and max work using a hierarchical reduction tree, the dataflow equivalent of a priority queue. They will update, under arbitrary changes to the input relation, in time proportional to the number of those changes.

> [...] it's hard for me to believe that your list of constraints is "there are none" especially when there are explicit-but-vague performance warnings in the docs.

I think we're done here. There's plenty to read if you are sincerely interested, it's all public to both try and read, but you'll need to find someone new to ask, ideally with a less adversarial communication style.

Sorry I’m definitely overly pessimistic when it comes to new database tech - you’ll find us industry hardened rdbms users hard to convince (we’ve been through a lot) thanks for chatting!

This is a big win for Rust.


I was wondering if Materialize is meant to be used in analytical workloads only, or would it be equally up to the task for consumer app kind of workloads as well?

Here's my take on this, from a few months back:


Great to hear they got more funding!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact