
Materialize: A Streaming Data Warehouse - irfansharif
https://materialize.io/blog-introduction/
======
luhn
I didn't really understand what the product actually did after reading this
blog post or the products page. I found the docs much more edifying:

> Materialize lets you ask questions about your data, and then get the answers
> in real time.

> Why not just use your database’s built-in functionality to perform these
> same computations? Because your database often acts as if it’s never been
> asked that question before, which means it can take a long time to come up
> with an answer, each and every time you pose the query.

> Materialize instead keeps the results of the queries and incrementally
> updates them as new data comes in. So, rather than recalculating the answer
> each time it’s asked, Materialize continually updates the answer and gives
> you the answer’s current state from memory.

> Importantly, Materialize supports incrementally updating a much broader set
> of views than is common in traditional databases (e.g. views over multi-way
> joins with complex aggregations), and can do incremental updates in the
> presence of arbitrary inserts, updates, and deletes in the input streams.

[https://materialize.io/docs/](https://materialize.io/docs/)

~~~
yomly
This reminds me a lot about Noria DB. Wonder if anyone familiar with both can
shed any further light?

~~~
benesch
Indeed, Materialize is quite similar to Noria, and has the Frank McSherry
stamp of awesomeness. [0] We know many of the Noria folks and have a lot of
respect for them and their work. I also worked on the Noria project for a
summer in college, and am a full-time engineer at Materialize now.

The biggest difference is one of intended use. Noria is, first and foremost, a
research prototype, intended to explore new ideas in systems research.
Materialize, by contrast, is intended to be a rock-solid piece of production
infrastructure. (Much of the interesting research, in timely and differential
dataflow, is already done.) We've invested a good bit in supporting the
thornier bits of SQL, like full joins, nested subqueries, correlated
subqueries, variable-precision decimals, and so on. Noria's support for SQL is
less extensive; I think the decisions have been guided mostly by what's
necessary to run its lobste.rs and HotCRP benchmarks. Make no mistake: Noria
is an impressive piece of engineering, but, as an enterprise looking to deploy
Noria, there's no one you can pay for support or to implement feature
requests.

One area where Noria shines is in partial materialization. Details are in the
Noria paper [1], but the tl;dr is that Noria has a lot of smarts around
automatically materializing only the subset of the view that is actually
accessed, while presently Materialize requires that you explicitly declare
what subsets to materialize. We have some plans for how to bring these smarts
to Materialize, but we haven't implemented them yet.

Also worth noting is that Materialize's underlying dataflow engine,
differential dataflow, has the ability to support iterative computation, while
Noria's engine requires an acyclic dataflow graph. We don't yet expose this
power in Materialize, but will soon. Put another way: `WITH RECURSIVE` queries
are a real and near possibility in Materialize, while (as I understand it)
`WITH RECURSIVE` queries would require substantial retooling of Noria's
underlying dataflow engine.

One of the creators of Noria, Jon Gjengset, did an interview on Noria [2] that
covered some of differences between Noria and differential dataflow from his
perspective, which I highly recommend you check out as well!

[0]:
[https://twitter.com/frankmcsherry/status/1056957760435376129...](https://twitter.com/frankmcsherry/status/1056957760435376129?lang=en)

[1]:
[https://jon.tsp.io/papers/osdi18-noria.pdf](https://jon.tsp.io/papers/osdi18-noria.pdf)

[2]: [https://notamonadtutorial.com/interview-with-norias-
creator-...](https://notamonadtutorial.com/interview-with-norias-creator-a-
promising-dataflow-database-implemented-in-rust-352e2c3d9d95)

~~~
choppaface
Do you have an example of where one might use WITH RECURSIVE?

~~~
gbrgr
A classical example is graph reachability. You'd express reachability via WITH
RECURSIVE stating that nodes are reachable if there is a path of arbitrary
length between them. (Recursion is needed since a plain SQL query can only
query for paths up to a fixed length).

------
slap_shot
> We believe that streaming architectures are the only ones that can produce
> this ideal data infrastructure.

I just want to say this is a very dangerous assumption to make.

I run a company that helps our customers consolidate and transform data from
virtually anywhere in their data warehouses. When we first started, the
engineer in me made the same declaration, and I worked to get data into
warehouses seconds after and event or record was generated in an origin system
(website, app, database, salesforce, etc).

What I quickly learned was that analysts and data scientists simply didn't
want or need this. Refreshing the data every five minutes in batches was more
than sufficient.

Secondly, almost all data is useless in its raw form. The analysts had to
perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate,
and project their business rules on that data. These functions often require
the database to scan over historical data to produce the new materializations
of that data. So even if we could get the data in the warehouse in sub-minute
latency, the jobs to transform that data ran every 5 minutes.

To be clear, I don't discount the need of telemetry and _some_ data to be
actionable in a smaller time frame, I'm just weary of a data warehouse
fulfilling that obligation.

In any event, I do think this direction is the future (an overwhelming amount
of data sources allow change data capture almost immediately after an event
occurs), I just don't think it's only architecture that can satisfy most
analysts'/data scientists' needs today.

I would love to hear the use cases that your customers have that made
Materialize a good fit!

~~~
benesch
> I just want to say this is a very dangerous assumption to make.

I think we're actually arguing the same points here. It's not that _every_ use
case needs single-digit millisecond latencies! There are plenty of use cases
that are satisfied by batch jobs running every hour or every night.

But when you _do_ need real-time processing, the current infrastructure is
insufficient. When you _do_ need single-digit latency, running your batch jobs
every second, or every millisecond, is computationally infeasible. What you
need is a reactive, streaming infrastructure that's as powerful as your
existing batch infrastructure. Existing streaming infrastructure requires you
to make tradeoffs on consistency, computational expressiveness, or both; we're
rapidly evolving Materialize so that you don't need to compromise on either
point.

And once you have streaming data warehouse in place for the use cases that
really demand the single-digit latencies, you might as well plug your analysts
and data scientists into that same warehouse, so you're not maintaining two
separate data warehouses. That's what we mean by ideal: not only does it work
for the systems with real-time requirements, but it works just as well for the
humans with looser requirements.

To give you an example, let me respond to this point directly:

> Secondly, almost all data is useless in its raw form. The analysts had to
> perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate,
> and project their business rules on that data. These functions often require
> the database to scan over historical data to produce the new
> materializations of that data. So even if we could get the data in the
> warehouse in sub-minute latency, the jobs to transform that data ran every 5
> minutes.

The idea is that you would have your analysts write these ETL pipelines
directly in Materialize. If you can express the cleaning/de-
duplication/aggregation/projection in SQL, Materialize can incrementally
maintain it for you. I'm familiar with a fair few ETL pipelines that are just
SQL, though there are some transformations that are awkward to express in SQL.
Down the road we might expose something closer to the raw differential
dataflow API [0] for power users.

[0]: [https://github.com/TimelyDataflow/differential-
dataflow](https://github.com/TimelyDataflow/differential-dataflow)

~~~
vhold
I think what might be really unique here that people aren't imagining, are the
new possible applications of having <100ms updates on complex materialized
views.

With sufficiently expressive SQL and UDF support there are whole classes of
stateful services that are performing lookups, aggregations, etc, that could
be written as just views on streams of data. Experts who model systems in SQL,
but aren't experts in writing distributed stateful streaming services would
basically be able to start deploying services.

Are there any plans to support partitioned window functions, particularly
lag(),lead(),first(),last() OVER() ? That would be remarkably powerful.

~~~
benesch
I agree wholeheartedly with your take!

Window functions are a particular favorite of mine, but we haven’t seen much
customer demand for them yet, so they haven’t been officially scheduled on the
roadmap. They require some finesse to support in a streaming system, as you
have to reconstruct the potentially large window whenever you receive new
data. Probably some interesting research to be done here, or at least some
interesting blog posts from Frank.

Please feel free to file issues about any of these functions that you’d like
to see support for! We especially love seeing sample queries from real
pipelines.

~~~
jacques_chester
I have a strong suspicion that bitemporalism makes a lot of these problems
less problematic. The actual volumes of data are the same, but the all-or-
nothingness of windowing over very large data sets in order to avoid missing
anything that arrived late goes away.

I wrote shambolic stream-of-consciousness notes on it several years ago:
[https://docs.google.com/document/d/1ZlPp099_fV1lyYWACSyuWY_j...](https://docs.google.com/document/d/1ZlPp099_fV1lyYWACSyuWY_jfUUM4kTkhk_0AVejN1k/edit)

The gist being that the mechanisms of windowing, triggering and retraction a
la Beam are actually workarounds for a lack of bitemporalism.

~~~
benesch
We’re thinking along very similar lines! We’ve got some of our thoughts around
evolving how timestamps work in Materialize written down here:
[https://github.com/MaterializeInc/materialize/issues/1309](https://github.com/MaterializeInc/materialize/issues/1309)

~~~
jacques_chester
Sort of -- the problem I see in the event time / processing time distinction
is that it's about _instants_ rather than _intervals_. There are a number of
models and queries that are not reliably expressible with instants alone,
unless you reinvent intervals with them.

For example, if I rely on "updated-at" and infer that whatever record has the
latest updated-at is the "current" record, then I may create the illusion that
there are no gaps in my facts. That may not be so.

A reference system to look at is Crux:
[https://opencrux.com/](https://opencrux.com/)

~~~
benesch
> For example, if I rely on "updated-at" and infer that whatever record has
> the latest updated-at is the "current" record, then I may create the
> illusion that there are no gaps in my facts. That may not be so.

I believe that notion is captured by timely's capabilities [0]. Your
capability has a current time, and you can only produce records at or greater
than the current time. So you could produce a record at, say, t + 3, then t +
5, and then produce a record at t + 1. But not until you downgrade your
capability to t + 6 will the record at t + 5 be considered final; downgrading
your capability is how you indicate that you have the correct and final set of
facts for all times less than t.

If your events can arrive out of order forever, then you have a problem, as
you'll never be able to downgrade your capability because you'll never be
willing to mark a time as "final." That's where bitemporalism (as mentioned in
that issue I linked previously) comes into play. You can mark a result as
final as of some processing time, and then issue corrections as of some
processing time in the future if some out-of-order data arrives. Materialize
will (likely) gain support for bitemporalism eventually, and the underlying
dataflow engine supports arbitrary-dimension timestamps already.

Would be happy to chat about this more, if you're curious, but I feel like
this discussion is getting a bit unwieldy for an HN thread! (At the very least
I might need to put you in touch with Frank.) Feel free to reach out on GitHub
[1] or our Gitter [2], or shoot me an email at benesch@materialize.io.

[0]:
[https://docs.rs/timely/0.11.1/timely/dataflow/operators/stru...](https://docs.rs/timely/0.11.1/timely/dataflow/operators/struct.Capability.html)

[1]:
[https://github.com/MaterializeInc/materialize/issues](https://github.com/MaterializeInc/materialize/issues)

[2]:
[https://gitter.im/MaterializeInc/community](https://gitter.im/MaterializeInc/community)

------
chrisjc
Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB
(not KSql, but [https://ksqldb.io/](https://ksqldb.io/)) does?

Seems that it's perhaps lacking the richness of how ksqldb uses Kafka
Connectors (sinks and sources), but I don't see any reason you couldn't use
Materialize in conjunction with ksqldb.

Eg:

KC-source --> ksql --> materialize --> kafka --> KC-sink

Question to Materialize...

What connectors (sinks and sources) do you have or plan to develop? Seems like
it's mostly Kafka in and out at the moment.

Why would I use this over KSqlDB?

Can I snapshot and resume from the stream? Or do I need to rehydrate to re-
establish state?

~~~
benesch
> Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB
> (not KSql, but [https://ksqldb.io/](https://ksqldb.io/)) does?

I'm not sure I'd say it's "more OLAP." ksqlDB is about as OLAP as it gets,
considering it doesn't support any sort of transactions or consistency. We
think Materialize is quite a bit more powerful than what ksqlDB offers, thanks
to the underlying technologies (timely/differential). For example, our joins
are proper SQL joins, and don't require you to reason about the complicated
and confusing difference between a stream and a table
([https://docs.ksqldb.io/en/latest/developer-
guide/joins/join-...](https://docs.ksqldb.io/en/latest/developer-
guide/joins/join-streams-and-tables/)). We also have preliminary support for
maintaining the consistency properties of upstream OLTP data sources, and
we'll be rolling out a more complete story here shortly.

> Seems that it's perhaps lacking the richness of how ksqldb uses Kafka
> Connectors (sinks and sources), but I don't see any reason you couldn't use
> Materialize in conjunction with ksqldb.

Is there something in particular about ksqlDB connectors that we don't seem to
support? Our CREATE SOURCE command is quite powerful:
[https://materialize.io/docs/sql/create-
source/](https://materialize.io/docs/sql/create-source/).

> What connectors (sinks and sources) do you have or plan to develop? Seems
> like it's mostly Kafka in and out at the moment.

We already support file sources in a variety of formats, and support for
Amazon Kinesis is on the short-term roadmap:
[https://github.com/MaterializeInc/materialize/issues/1239](https://github.com/MaterializeInc/materialize/issues/1239).

Are there other connector types you'd like to see?

> Can I snapshot and resume from the stream? Or do I need to rehydrate to re-
> establish state?

At the moment you can't snapshot and resume, but support for this is planned.

~~~
chrisjc
Thanks for the detailed response. At this point i think the onus is on me to
go and take a deeper look into timely/differential.

Important "sources" to me are obviously Kafka, but also MySQL and Mongo.
Important "sinks" would be Snowflake (maybe through S3, or directly though
PUT) and ElasticSearch. Although I imagine you might soon be telling me that
you don't need a data warehouse once you have Materialize :)

------
irfansharif
(Linked in the post but) github repo:
[https://github.com/MaterializeInc/materialize](https://github.com/MaterializeInc/materialize)

------
samuell
For anyone interested in the details behind all of this, you should check out
Frank's blog:

[https://github.com/frankmcsherry/blog](https://github.com/frankmcsherry/blog)

------
drej
I really like the pg protocol (like e.g. Cockroach), it let me use my usual
tools. There are a few things I noticed:

1\. It has a fairly rich support for types - these new-ish SQL engines often
lack quite a lot of things, but this seems pretty decent. 2\. I don't see any
comparisons to KSQL, which seems to be the primary competitor. 3\. Read the
license. Read it carefully. It has a weird "will become open source in four
years" clause, so keep that in mind. It also disallows it being hosted for
clients to use (esentially as a DBaaS).

~~~
jamii
The main difference vs KSQL is that we support standard SQL (roughly SQL92 +
correlated subqueries + json so far) and provide strong consistency:

KSQL has a distinction between streams and tables, effectively giving you
control over how views are materialized but also forcing you to do that work
yourself. In Materialize you just write the same SQL that you would for a
batch job and the planner figures out how to transform it into a streaming
dataflow.

KSQL is also eventually consistent - late arriving records might cause changes
to previously emitted answers. Materialize won't give you an answer for a
given timestamp until it can guarantee that the answer is correct and won't be
retroactively updated later.

Expect to see some performance comparisons soon too.

------
kiwicopple
For anyone that might be considering trying something similar with their own
Postgres database (PG10+), we recently opensourced this:
[https://github.com/supabase/realtime](https://github.com/supabase/realtime)

It's an Elixir (Phoenix) server that listens to PostgreSQL's native
replication, transforms it into JSON, then blasts it over websockets.

I see that Materialize are using Debezium, which will give you a similar
result, just with connectors to Kafka etc

~~~
thom
The magic here isn't that it's giving you streaming updates of a database,
it's that it's making some of those updates ridiculously fast. The point is
that if you have a materialized view that takes a long time to update (or just
a stack of views that are slow to return results), you now have results
instantly whenever changes come in. If you built a bunch of code to work
around this and do clever invalidation of cached data, you get to throw all
that away. The input/output into the system is less of an issue overall.

------
gaogao
The "you may not cluster any server instances of the Licensed Work together
for one use" in the license is a fairly tricky clause. Under this clause, how
would one run a fault-tolerant instance of Materialize?

~~~
benesch
You would need to contact us to buy an enterprise license. (Or wait four years
for the license to convert to Apache 2.)

------
streblo
How does materialize compare in performance (especially ingress/egress
latency) to other OLAP systems like Druid or ClickHouse? Would love to see
some benchmarks.

~~~
jamii
We're working on benchmarks, but I expect that at the moment Materialize will
be slower for one-off queries but faster for frequently repeated queries.

------
solidangle
> Blazing fast results

I highly doubt this, given that the query engine is interpreted and non-
vectorized. Queries are 10x to a 100x slower on a simple query, and 100x to
1000x slower on a query with large aggregations and joins without compilation
of vectorization.

> Full SQL Exploration

Except for window functions it seems. These actually matter to data analysts.

~~~
benesch
Considerations are completely different in a streaming context. It’s not so
much about how fast you can churn through terabytes of data; it’s more about
how quickly you can turn around the incremental computation with each new
datum. There’s some serious research behind this product, in timely and
differential dataflow, and I’d encourage you to check out some of that
research before making sweeping performance claims. Frank’s blog post on TPC-H
is a good place to start:
[https://github.com/frankmcsherry/blog/blob/master/posts/2017...](https://github.com/frankmcsherry/blog/blob/master/posts/2017-04-24.md)

We definitely have some performance engineering work to do in Materialize, but
don’t let the lack of vectorization scare you off. It’s just not as important
for a streaming engine.

------
1290cc
Pretty cool tech although I feel they may have missed the moment as AWS, Azure
and GCP are becoming hypercompetitive to solve all things related to
data/storage. Azure has been churning out major updates to its services and
clearly taking inspiration from companies like Snowflake. AWS I think
hesitated to compete with Snowflake as they were running on AWS anyway -
win/win for them.

Snowflake had incredible timing as they hit the market just before CFO's and
non-tech business leaders realized the cost and talent needed to pull off a
datalake successfully was more than they'd like. Those that were sick of the
management jumped to Snowflake fast and AWS/Azure never really responded until
recently.

Awesome to see all the innovative takes on solving these extremely technical
problems! I love it!

~~~
manigandham
Google's data product suite seems to be the most advanced today with a
completely no-ops approach and very solid primitives to work with.

Snowflake still doesn't an answer to streaming data (other than their fragile
Kafka connector) and Azure Synapse still isn't publically available even
months after their announcement.

AWS had a good headstart but they keep piling on more products and features
that it's now a big mess and requires yet another tool (AWS Lake Formation)
just to wire it all up.

------
yayr
I am curious about the physical storage. Is it purely in-memory or is there a
disk persistency possible? Is there some kind of data compression applied or
what are the memory needs of it? Is it a row or column based data persistence
pattern?

------
cbdumas
I'm not sure if it's a mistake or just some cheeky humor but the "Known
Limitations" link on their Docs page returns a 404.

~~~
frankmcsherry
It looks like a caching bug somewhere in the web pipeline. It isn't running on
Materialize, sadly.

There are a pile of known limitations, and our goal is to be 100% clear about
them.

~~~
polskibus
Hi Frank! Congrats on the progress - I remember listening to you describing
Materialized at SIGMOD'19 in the big hall - it sounded like magic back then,
glad to see it come true :) I have a question now - can you subscribe to view
updates somehow, instead of querying them on an ad-hoc basis?

~~~
frankmcsherry
Yes, the term we're using is "Sink", dual to a Source.

    
    
        https://materialize.io/docs/sql/create-sink/
    

At the moment you can create Kafka sinks, and we'll write the change log out
to that topic. They are still baking though, and need some hardening for
failure cases.

~~~
polskibus
Thanks. What happens when a DDL is run on the source table (add or remove
column for instance) ? Does it break the entire flow?

~~~
frankmcsherry
DDLs that alter schemas definitely cause things to catch on fire at the moment
(Debezium writes them to a side-topic, which isn't obviously synchronized with
the data topics). In principle `CREATE` and `DROP` statements are things we
could pass through, but some work needs to get done to pull the right
information out of Debezium at the right moments (and drop the right prefixes
of the existing Kafka topics if you re-create a table with the same name).

~~~
jacques_chester
> _Debezium writes them to a side-topic, which isn 't obviously synchronized
> with the data topics_

That's unfortunate. It seems like it would be ideal to feed schema changes
through the same topic, so that the ordering is defined without the usual
distributed-system hilarity.

~~~
arjunnarayan
Agreed! We plan to contribute to the Debezium project over the next months
along these lines. We’ll have more to say on this shortly!

~~~
gunnarmorling
Nice, looking forward to your contributions. Let me know if you'd like to
discuss anything upfront. On a general note, have you considered to rely on
the schema of change events themselves (inline JSON schema, or Avro schema in
a registry)? Not all connectors provide the separate schema history topic, you
won't have to parse DDL and the synchronization issue is avoided at all. Happy
to discuss in more detail.

Disclaimer: I'm the Debezium lead.

------
manigandham
Congrats on the launch, always nice to see new products.

This is an interesting mix between the (now obsolete) PipelineDB, TimescaleDB
with continuous aggregates, Kafka and other message systems with
KSQL/ksqlDB/KarelDB, stream processing engines like Spark, and typical RDBMS
like SQL Server with materialized views.

The amount of research to support complex and layered queries definitely sets
this apart.

------
mason55
Not sure how the featuresets compare but AWS is releasing materialized views
for Redshift sometime soon and one of the things it will support is
incremental refresh (assuming your view meets some criteria).

I'm sure Materialize is better at this since it's purpose-built but if you're
on Redshift you can get at least some of the benefits of incremental
materialize.

~~~
chrisjc
It's been a while since I've used Redshift, but isn't it still dependent on
data coming in via a COPY from S3? Any sort of Redshift materialized view
offering would depend on batches of data landing in an underlying table or
tables. The closest service offering from AWS is probably using Kinesis
analytics (or Flink on KA) using their flavor of streaming SQL to join Kinesis
streams forming new ones.

~~~
mason55
With the introduction of Spectrum you can back a Redshift table with data in
S3 directly. I'm not sure how that interacts with the materialized views
though. Probably not supported yet but I would expect it to be eventually.

------
wiradikusuma
Materialize connects directly to event stream processors (like Kafka) --- how
about Pulsar? (Goggling doesn't yield anything useful, Materialize and Pulsar
are both name of brands of other things)

~~~
chris_overseas
Connecting Pulsar and Materialize is of interest to me too and something I
might try when I find time to do so. Note that Pulsar does have a Kafka
compatibility layer already[0], so it might just work out of the box. If you
try this I'd be keen to hear how it goes.

[0] [https://pulsar.apache.org/docs/en/adaptors-
kafka/](https://pulsar.apache.org/docs/en/adaptors-kafka/)

EDIT: I don't think this adaptor will work after all, it works by replacing
the Kafka Java client library with its own, so is only applicable to Java
applications.

~~~
jamii
Adding sources is pretty easy. We just need them to provide:

* timestamped changes eg "record foo was deleted at time 42"

* watermarks eg "I've now sent you all the changes up to time 42"

Here's the implementation of the kafka source -
[https://github.com/MaterializeInc/materialize/blob/master/sr...](https://github.com/MaterializeInc/materialize/blob/master/src/dataflow/source/kafka.rs).
(There's also a bit of code in the parser/planner to wire up the CREATE SOURCE
syntax, but it's fairly trivial.)

------
simo7
I'm wondering how this technology could work for OLAP cubes.

An OLAP cube that is automatically & incrementally kept in sync with the
changes in the source data sounds promising.

Is that a potential use case?

------
polskibus
Looking back at the project, knowing what you know now, if you were to start
again (but without obtained rust skills), would you go with rust again or pick
another toolbox?

~~~
jamii
> would you go with rust again

Absolutely. Even aside from safety, rust has so many quality of life
improvements over C++ - algebraic data types, pattern matching, checked errors
instead of exceptions, sane standard library, easy packaging, readable error
messages, #[must_use], immutable utf8 strings, traits instead of templates,
hygienic macros etc.

Other than compile times and editor tooling, which are being actively worked
on, the only real pain point I can think of is the poor support for custom
allocators.

As for other alternatives, I don't personally have strong opinions on go but I
think it's notable that none of the ex-cockroachdb folks at materialize
suggested using it instead of rust.

------
mnkmnk
How does this compare to Spark structured streaming? That too allows writing
SQL on streams and having the state update incrementally.

------
gbrits
Is this similar to TimescaleDB's Continuous Aggregates? Interested in knowing
the overlap / differences.

~~~
benesch
Similar in concept, but much more powerful in execution. We can incrementally
materialize practically any SQL 92 query, with the killer features being joins
and correlated subqueries. I’m not super familiar, but TimescaleDB’s
continuous aggregates (just on a cursory glance) don’t support stacked
aggregates, for example:
[https://github.com/timescale/timescaledb/issues/1400](https://github.com/timescale/timescaledb/issues/1400)

------
rl3
Their careers page says it's built in Rust.

That's always nice to see, since Rust jobs are somewhat rare.

------
gbrits
Looks promising! Can materialized views be backfilled?

~~~
vhold
According to the documentation at
[https://materialize.io/docs/overview/architecture/](https://materialize.io/docs/overview/architecture/)

> _Streaming sources must receive all of their data from the stream itself;
> there is no way to “seed” a streaming source with static data. However, you
> can union streaming and file sources in views, which accomplishes a similar
> outcome._

What I find odd is that the documentation doesn't show UNION support in the
SELECT documentation :
[https://materialize.io/docs/sql/select/](https://materialize.io/docs/sql/select/)

Perhaps just an oversight in the documentation?

~~~
benesch
Good catch, thanks! I've filed an issue about this:
[https://github.com/MaterializeInc/materialize/issues/2056](https://github.com/MaterializeInc/materialize/issues/2056)

Though it's not yet documented, we do support all of the SQL set operations
[0]: UNION [ALL], INTERSECT [ALL], and EXCEPT [ALL].

[0]: [https://www.postgresql.org/docs/current/queries-
union.html](https://www.postgresql.org/docs/current/queries-union.html)

------
sashavingardt2
I'm intrigued by the product. The SQL examples on the product page are
atrocious though ((((

------
programmarchy
Isn't BigQuery real-time as well? For me, the wow factor is that you can host
this yourself.

~~~
chrisjc
I'm unfamiliar with BQ, but from what I understand BQ doesn't have any concept
of streams... with maybe an exception for change-streams (change tracking) and
that's very different and less powerful than what Materialize is able to do.

After all, in BQ you're only able to query what has been already has been
ingested or is being digested and that is somewhat by definition not 'real-
time'.

~~~
manigandham
GCP has PubSub which is their streaming/messagebus product. Cloud Dataflow is
their stream/batch processing engine (backed by Apache Beam).

There's a new feature called "Dataflow SQL" that lets you submit a declarative
SQL query through the BQ UI as a job to Dataflow which can join both BigQuery
tables and PubSub topics in the same job.

[https://cloud.google.com/dataflow/docs/guides/sql/dataflow-s...](https://cloud.google.com/dataflow/docs/guides/sql/dataflow-
sql-intro)

~~~
chrisjc
Thank you for the correction. Sounds like a very slick solution bringing batch
and streaming (the B and EAM obviously) together.

------
tschmidleithner
See also
[https://news.ycombinator.com/item?id=22346915](https://news.ycombinator.com/item?id=22346915)

~~~
dang
This comment was briefly downvoted and flagged because the convention on HN is
only to link to past threads that have comments. Otherwise users click on the
link, don't find the discussion they expected, and get ornery.

Since you're new and likely didn't know this, I've restored your post. (And
welcome!)

------
justlexi93
Clickhouse has materilized views and is free.

~~~
vhold
I think the biggest difference is that Materialize can do any kind of SQL join
on many tables at once. Clickhouse materialized views can only reference one
table.

What I'd like to know is if that would enable basically implementing social
networks as just 3 tables and one materialized view, and how it would scale
and perform.

Users, Relationships, Post, and a Feed materialized view that simply joins
them together with an index of user_id and post_timestamp.

As relationships and messages are created or deleted, the feed view is nearly
instantly updated. The whole entire view service logic then is just one really
fast query. "select user,post,post_timestamp from feed where user_id =
current_user and post_timestamp <= last_page_post_timestamp order by
post_timestamp desc limit page_size"

~~~
wrmsr
In the real world it is inevitably a lot more than just those 3 tables - add
'groups', different types of groups, different privacy settings, different
per-user feed preferences, experiments, and any number of other things which
_can_ be expressed as pure, normalized, joined tables in a matview but make
naive approaches a lot less likely to actually work in prod.

In my experience the most successful approach to this is a midpoint - you
materialize/denormalize enough to feed your app endpoints and search engines
but retain flexibility in searching those fat but instantly available docs,
and relatedly you also don't always need to preemptively materialize
absolutely everything in any particular view - see
[https://engineering.fb.com/data-infrastructure/dragon-a-
dist...](https://engineering.fb.com/data-infrastructure/dragon-a-distributed-
graph-query-engine/) . Without being able to transparently operate on
arbitrarily partially populated matviews you are locked into a self-defeating
all-or-nothing system that is likely to culturally do more harm than good with
its rigidity. Imagine for example if there were no 'caches', just a binary
choice of precomputing everything ahead of time or recomputing everything
every time. Neither extreme is sufficient for all cases and real applications
are comprised of many different points on that spectrum.

~~~
vhold
That's fair. It will be interesting to see what people do along those lines,
creating various materialized views, joining them at query time, chaining
materialized views, and I think most important to your point, creating new
kinds of sinks for the updates.

Right now Kafka is the only sink ( [https://materialize.io/docs/sql/create-
sink/](https://materialize.io/docs/sql/create-sink/) ), but because it
integrates with Confluent's schema registry, I'm guessing it should work well
with many of Confluent Connect's sinks, (
[https://docs.confluent.io/current/connect/managing/connector...](https://docs.confluent.io/current/connect/managing/connectors.html#connect-
bundled-connectors) ). Elasticsearch would be an especially useful sink
connector I think to your point.

I haven't used any of these things together, so right now I'm totally
speculating on the potential.

What I'm mostly envisioning is that there are a lot of smaller scale
applications where the complexity of adding an activity feed just isn't worth
it. But if you could implement a feature like that trivially it could be game
changing.

