Hacker News new | past | comments | ask | show | jobs | submit login
Readyset: A MySQL and Postgres wire-compatible caching layer (github.com/readysettech)
164 points by lsferreira42 9 months ago | hide | past | favorite | 70 comments



In the Microsoft SQL Server space, several of these vendors have come and gone. My clients have been burned badly by 'em, so a few quick lessons learned:

Be aware that there are hundreds of open issues[0] and dozens of pull requests [1], some of which involve clients being unable to connect or not supporting all components of the SQL language. Just because your database supports something, doesn't mean your caching layer will.

It gets really ugly when a new version of your database comes out, with brand new features and language enhancements, and the caching layer doesn't support it. It may take months, or in some cases years, before the caching layer is feature-complete with the underlying database. If you want to use some of those language enhancements, then your app may have to maintain two connection strings - one for the caching layer, and one for direct database queries that the caching layer doesn't support.

Your support teams need to learn how to diagnose problems with the caching layer. For example, this issue [2] talks about the level of work involved with understanding why newly inserted data isn't showing up in selects.

I hope they succeed and deliver the concept, because it's one of the holy grails of databases.

[0]: https://github.com/readysettech/readyset/issues [1]: https://github.com/readysettech/readyset/pulls [2]: https://github.com/readysettech/readyset/issues/39


I don't think it's fair to hold the number of open issues and pull requests against them. Looking through them for a minute, it looks like 95%+ are from their own team members, with a good chunk of the issues being "low priority" issues. So you are just seeing the typical ever-growing backlog that is normally in a private JIRA instance.

Having said that, the way they work with pull requests is unlike anything else I've seen. I see that they are using a merge bot, but apart from that all branch names are completely illegible. As a lot of the team seems to be present in this thread, it would be interesting to get some details about that.


We use Gerrit internally and use GitHub as a pseudo-mirror. We allow (and welcome!) community contributions to GitHub via PRs.

Our internal Changelists in Gerrit get synced to GitHub as those PRs you are seeing (the branch names correspond to Gerrit change-ids), and any community PRs get copied to our Gerrit first before going through CI and then being merged and synced with GitHub.


They use the database's replication API. In theory, it should avoid situations where the cache doesn't understand query syntax.

But, what I'd worry about are situations where an application needs to cache an object that's expensive to build. (And perhaps expensive to build because the programmer doesn't understand how to optimize SQL.)


Regarding the new database version issue, I wonder why the caching layer can't just pass any query it is unable to process on to the underlying database?

This would be more complex if the feature you are using does not return a normal table of results back (e.g. the pub/sub support in Postgres).


> I wonder why the caching layer can't just pass any query it is unable to process on to the underlying database?

Because sometimes it's not about the query you pass in - it's about a new data type, for example. The layer may think it understands the query, but may not be able to handle the format of the results, or may cache them incorrectly.

For example, when SQL Server added the JSON data type, one of my customers found that the caching layer mangled it for some clients, but not others, or would return the correct result the first time, but the cached result was formatted incorrectly.

Another problem can be the connection itself, like when SQL Server added support for Always Encrypted, an end-to-end encryption method.


> I wonder why the caching layer can't just pass any query it is unable to process on to the underlying database?

A caching layer can do that, and I'd be surprised if serious ones don't. It might not be perfect when moving between DB versions though: the new version might perhaps introduce syntax that is mistaken for something else which the layer _thinks_ it understands. While this can be tested for to a large extent by the devs having access to alph/beta/rc versions, but I can still see things slipping through and you are relying on people keeping their versions of the caching layer as up-to-date as everything else which is not a given.


That is pretty much what Readyset does. The SQL parser identifies queries that are supported (and cached), and passes through all other requests to the backend database.

As we expand query support, we allow users to cache more and more queries.


This is one of the deepest deep tech startups I've seen in a long time. I had the pleasure to meet some of the folks at RustConf in Portland.

Readyset is basically reimplementing a full database, at the absolute bleeding edge of db research, enabling global partial replication of any kind of data.

A solution desperately needed, as databases grow.

You can think of it as an intelligent LRU cache in front of your database. An important step towards fast globally distributed applications.

I hope this project will get more publicity and adoption - it's very well deserved.


Pretty sure that this is the database that Jon Gjengset[0] was working on as part of his thesis project. There have been several videos shared by him during talks about the system. It's a really interesting concept.

edit: Here's[1] a video where he talks about the concept

[0]: https://www.youtube.com/@jonhoo [1]: https://www.youtube.com/watch?v=GctxvSPIfr8


Someone knowledgeable might know: is this just incremental view updates? To what extent is the cache intelligent if parameters, where clauses, or aggregates change?

I really love this space and have been impressed with Materialize, but even if you can make some intermediate state incremental, if your workload is largely dynamic you end up needing to jump the whole way to OLAP platforms. I’m hopeful that we’re closer and closer to having our cake and eating it here, and that the operational data warehouse is only round the corner.


They have a bit about their technical foundation here[0].

Given that Readyset was co-founded by Jon Gjengset (but has apparently since departed the company), who authored the paper on Noria[1], I would assume that Readyset is the continuation of that research. I wouldn't call that "just" incremental view maintenance, as it's a lot more involved than the simplest implementation of IVM (though obviously that is the end-goal).

So it shares some roots with Materialize. They have a common conceptual ancestry in Naiad, where Materialize evolved out of timely-dataflow.

[0]: https://docs.readyset.io/concepts/streaming-dataflow

[1]: https://jon.thesquareplanet.com/papers/osdi18-noria.pdf

[2]: https://dl.acm.org/doi/10.1145/2517349.2522738

[3]: https://github.com/TimelyDataflow/timely-dataflow


Just poking my head in to say that I technically never departed ReadySet — what happened was that I co-founded the company, but was so burnt-out when it came to databases after my PhD that I decided to leave the running of the company to others. Then, US visa regulations made it so that I couldn't really be involved _at all_ if I wasn't an actual employee, which meant I truly was "just a founder" with no real involvement in the company's execution if you will. Now that I'm back in Europe, that's changing a bit, and I have regular calls with the CEO and such!


Thanks so much for your work! Visa problems are a bane on the US. I hope that this situation will improve for you and get better in the future.


Wow, uh, Noria got listed a bunch of times looking back in search. https://hn.algolia.com/?q=noria

(It turns into Nokia results part way through page 2, but still a lot of submissions!)


Readyset auto-parameterizes cached queries similar to a prepared statement. If you run the same query with different parameters, it will be routed to the cache. The first time a parameter set is queried, it will be a cache miss and trigger an "upquery" to populate the cache, after which that set of parameters will be served from the cache.

Different where clauses (sets of parameters) would map to different query-caches and currently need to be cached separately.

Aggregates supported by Readyset[1] will also be automatically maintained, but depending on the query, they may be handled by post-processing after retrieving results from a cache

[1] https://docs.readyset.io/reference/features/queries#aggregat...


From a tech perspective, this is really cool. From a use case perspective, could someone help me understand why a developer would adopt something like this over a database like Clickhouse, outside of some fintech use cases where milliseconds of latency really matter? I'd be worried about introducing an additional point of failure to the data stack. And, if this is like Materialize, I'd be worried about this not suppporting ad hoc queries -- only precomputed ones.


> From a use case perspective, could someone help me understand why

Imagine a legacy system that has a method that dynamically query-builds a massive query based on several method parameters (say 10 or 20), this method is used in two dozen places or more. The underlying tables are used in a million other places. Rewriting the query building method or, even worse, changing the underlying data model, would be expensive.

Now imagine that you could speed up some of these queries WITHOUT changing your code or model or rolling your own cache solution (the invalidation of which is always a real PITA). All this basically for free.

I don't think "why a developer would adopt something like this over a database like Clickhouse" is the right take. They do not compete. It's not about "adopting a database", that decision has been typically made a long time ago in a galaxy far far away and by someone else than you. Of course unless you work on green field projects or small enough projects that "adopting a different database" is even a question. I'd love some of that stuff :) ... one of the biggest systems I worked on for several years had close to 700 mysql tables (yea, not colums, tables), basically anything that was anywhere near the core of the system took ages to change and test. I can't possibly imagine the investment it would require to move that system from mysql to something else while not making a billion bugs along the way. I could imagine using something like Readyset, especially if it handles cache invalidation for you based on underlying model data changes.


> rom a use case perspective, could someone help me understand why a developer would adopt something like this over a database like Clickhouse, outside of some fintech use cases where milliseconds of latency really matter?

Concurrency. ClickHouse works best with a relatively small number of concurrent queries: hundreds to low thousands, not 10s of thousands or more. That allows each query to hog more resources and get done quickly.


They have a pretty good writeup on why you’d want to use this here: https://blog.readyset.io/dont-use-kv-stores/


To be honest, I see more like a write-up about the shortcomings of using a classic cache system in front of your DB rather than about what Ready set does. Yes, it explains it but it could be better (for example, practical example of how you set a query to be cached, an overview how replication stream ingestion works under the hood etc)


This sounds like it has heavy overlap with IVM. How does Readyset distinguish itself from existing solutions like pg_ivm or Materialize?


ReadySet descends from Noira (https://www.usenix.org/conference/osdi18/presentation/gjengs...), which I view as the next generation of incremental dataflow technology after Naiad/Differential Dataflow/Materialize.

The key difference is that Noira/ReadySet supports partial materialization and can reconstruct data on-demand, whereas Naiad/Differential Dataflow/Materialize must keep a complete materialization up to date. ReadySet can partially evict parts of the data flow and bring them back later if needed.

In practical terms if you have a materialized view you want to maintain, in ReadySet you pay O(part of the materialized data flow you actually need), which is less than O(entire materialized view) you’d pay with Materialize.


(Materialize CTO here.)

Partial materialization is indeed Noria's major contribution to dataflow technology, and it's impressive stuff. But I want to call out that there are a number of techniques that folks use with Materialize to avoid paying for O(entire materialized view). The two most common techniques are demand-driven queries using lateral joins [0] and temporal filters [1]. Noria's approach to partial materialization is automatic but gives the user less explicit control; Materialize's approach is manual, but gives the user more explicit control.

The other major divergence between Materialize and Noria is around consistency. Noria is eventually consistent, while Materialize is strongly consistent. There is a caveat to Materialize's consistency guarantees today: we don't offer strong consistency across your upstream {Kafka, PostgreSQL, MySQL} and Materialize. You only get strong consistency within Materialize itself. But we've got an improvement for that in the works that'll be rolling out in the next few months.

[0]: https://materialize.com/blog/lateral-joins-and-demand-driven...

[1]: https://materialize.com/docs/transform-data/patterns/tempora...


Thanks for the great reply, I didn’t know about lateral join.


One of the things we love about using JPA (We use EclipseLink) is it comes with caching, for free, and it’s transparent. You can mark any field as a cache index and it automatically tries the cache first. Updates are published and loaded into every nodes cache automatically, and you get fallback protection in the form of incremental version numbers on rows.

The one thing it can’t handle however is range update queries or native queries that perform updates.

You can just avoid them in your architecture… OR maybe this is the solution we’ve been looking we’ve been looking for! definitely going to give this a spin!

Documentation looks very complete and I like there’s a UI to view the query cache.


I dont understand what the use case is for this.

If I have a front end, I would hope that the formated response is what were caching. Be that HTML or JSON.

If I cant read from that cache then I should be reading from fresh data all together? right?


Ever add (or inherit) the server-side part of an application that uses Redis or Memcache? The data is denormalized: When you do an update/insert/delete into the SQL side of things, you need to do a corresponding change in Redis / Memcache. All of your queries end up being something like: Try Redis/Memcache, if the data isn't present, query the database and insert into Redis/Memcache.

It (Redis/Memcache caching) adds a huge amount of complexity to your application, and the risk of defects is very, very real.

ReadySet basically gives you a magic "stick this thing between your database and application and we'll do the caching for you." It totally eliminates a time consuming and error-prone part of your application.

If you want to, they go into details here: https://blog.readyset.io/dont-use-kv-stores/


You are reading from "fresh" (though only eventually consistent) data. For many public facing queries that's enough.

After an initial query seen by their proxy, you can configure all future queries of the same kind to be pre-computed.

So I think it makes more sense to think of it like an auto-updating materialized view available with the click of a button rather than a cache.

There are some more under-the hood details here: https://docs.readyset.io/concepts/overview#how-does-readyset...


Normal request: Request -> work -> work.. -> query/db

Efective Cache: Request -> (less)work -> cache

This Product: Request -> work -> work.. -> query/cache

I understand the concept of caching at a boundary layer. I fail to see the point of cache at THIS boundary layer. You have all the problems of a cache with fewer benefits (you're not going to fix a thundering Hurd at this level).


> You have all the problems of a cache with fewer benefits

No you don't. You have next to none of the problems of a cache (especially as you directly have to opt-in individual queries to it) like cache invalidation, etc., with all the benefits. It's about as free as performance benefits can be (from a implementations standpoint).

I understand that for many use-cases caching at the response level may be preferable, but there are also many use-cases that are read-heavy but also involve a lot of computed values that are updated and have to be recomputed regularly, where this dataflow-based approach has been shown to be one of the least compute intensive and efficient solutions.


>> I understand that for many use-cases caching at the response level may be preferable

Yes after all the data + work/compute

>> there are also many use-cases that are read-heavy but also involve a lot of computed values that are updated and have to be recomputed regularly, where this dataflow-based approach has been shown to be one of the least compute intensive and efficient solutions.

From a comp sci, from a programing, from an engineering perspective I get this. But at the end of the day those are just the hammers and nails of the business. I am wondering where the actual, in production use case, with business need requires this. I can think of a dozen technical fuck ups where this solution is appealing but that's just stacking irresponsibly...


Sometimes, you have to deal with closed-source and legacy software for which there is no sufficient support in the application. This proxy can help boost the performance of such applications.


How do you invalidate your cache? Is it complicated? What about search queries with many filters? What about responses that are unique per user? What about common queries that are shared between multiple views, do you constantly re-run those as your response cache is invalidated?


Caches are never invalidated. Readyset uses CDC to receive updates from PostgreSQL/MySQL and update the cache entries. No invalidation required. The price you pay is eventually consistent data, which is already true if you use any async replication like readyset does.


That’s how ReadySet works, I’m asking the grandparent (rhetorically) how their own method of caching works for them


What if it's a backend service that is not exposed to a frontend that receives html or Json. E.g a task that takes an id and a operation name and needs to fetch from the database a bunch of data to process.


You will add readyset between your backend and database in order to cache the data you fetch from db.


I'm answering the comment above mine that is asking for a user case. I'm saying that my use case is an example where json/html caching is not sufficient. 100% agree with you.


This isn’t an open source project (which isn’t a bad thing! Just a non-obvious thing if you don’t scroll down their whole readme)


The project is an open-source project and released under a BSL license. We accept community contributions, and you have the option to build from source and self-host the solution as well.

We released a cloud-managed solution that we will charge for: https://readyset.cloud/


BSL is not Open Source license. The fact outdated code Eventually becomes Open Source does not make it Open Source license

It is great ReadySet is Source available and free to use in number of circumstances but I wish you would avoid misleading marketing it as Open Source


I apologize. I unintentionally misspoke. You are correct that BSL is a Source Available license as opposed to an Open Source license.

I don't believe the website or the Github page makes that claim.


On top of that, quoting @martypitt reply:

> Most commonly the restrictions prevent you from launching a competing offering. In their case, you can't offer database-as-a-service using their code.

Meaning the self hosted version is free to use in any number of servers having in mind the competing offering restriction.


Many people's biggest issue with BSL is that there are dozens of versions of the "additional use grant" which each have bespoke language with very critical clauses, none of which have any case law behind them (correct me if I'm wrong).

Even though software may be licensed under "BSL" it isn't really a standard, even though proponents tend to use the term "BSL" as if it is the same as talking about "GPL" or "BSD". The MariaDB BSL used here is quite different than the HashiCorp BSL, for instance.


In this case the additional use grant is:

  Additional Use Grant: You may make use of the Licensed Work, provided that
  you may not use the Licensed Work for a Database Service. A ‘Database
  Service’ is a commercial offering that allows third parties (other than
  your employees and contractors) to access the functionality of the Licensed Work.
IANAL but to me that sounds like it could be interpreted as almost any commercial work based on it since some part will "access the functionality", indirectly or directly. I know that is not what they intended, but the language is loose enough to allow almost any interpretation.


Trying to add transparent caching to a transactional database is just a bad idea and cannot work. Anyone who says it works for them is just in the period after putting it in place and before when they realize why it cannot work.

If it was possible to just slap a cache in between you and the db and magically make shit fast, DB vendors would have done that 20 years ago. Billions of dollars a year is put into relational db development. Papers are published every week, from theoretical ways to model and interact with data to practical things like optimizing query execution plans.

Unless Readyset can point to a patent or a paper that has fundamentally revolutionized how database will be built from today forward it is going to be crap and will burn you.


This might be good tech and a good company.

Once we used a distributed caching system in a startup which was open source. Then the open source version got cut features we needed, so we bought a license. Then the startup was bought up by a large software company and the license costs went 10x YoY with a one week notice. As our migration away from this tech was not done, because it was very complicated and tied into our application we had to pay. Luckily we also had been bought and the very large costs were not a problem. I would never again use something from a company that is crucial to our operations.


That is a good point on the application changes. What is appealing from Readyset is that it does not require you to change your application code. You can just change your database connection string to point to it and it will start to proxy your queries to your database. From there you can choose what you want to cache, and everything else (writes, non supported queries, non cached read queries) will be automatically proxied to your database.


But the read side is already fairly trivial to scale with read replicas


Some queries might be too slow at p95 even on a read replica with no other clients. Those kinds of queries can benefit greatly from a materialized view, and incremental view maintenance as data changes.


Not denying your point but in some applications (which aren't outliers) a good cache can increase number of requests served 100x. Achieving those with read-replicas is easy and possible but can be quite expensive.

A read-replica doesn't give you perfect transactional guarantees as well. Read operations to the replica after a write operation to the leader might still give you stale data (lagging by a few milliseconds).


Yes to a degree, but not if you require/value consistency.


Well Postgres has sync replication if you value consistency


Love the work. Looks quite similar to what PlanetScale Boost does[1]. Basically the same but as a front-end to someone's existing database? (disclaimer: I work at PS).

[1]: https://planetscale.com/blog/how-planetscale-boost-serves-yo...


They're both based on the techniques outlined in the Noria paper (https://www.usenix.org/conference/osdi18/presentation/gjengs...) and my thesis (https://jon.thesquareplanet.com/papers/phd-thesis.pdf), so not terribly surprising they carry some resemblance :p


I imagine a good use case for this at its current stage would be for powering up a monitoring dashboard that runs ad-hoc queries against your operational DB. I've seen this situation in a previous Fintech company I worked at, where we had some people staring at dashboards all day long looking for issues in any of the subsystems.


I just wanted to give a high five for having Jepsen tests for this: https://github.com/readysettech/readyset/tree/stable-240117/...


Slight tangent, but this reminds me of discussions I've seen in the Postgres email servers about native support for real-time materialized views. Does anyone know if we can expect to see something like this in a future version of Postgres?


I don't know about the support in core but there is this extension https://github.com/sraoss/pg_ivm


Anyone successfully using? There are a few other services out there like PolyScale. It will be interesting to see if any of these introduce some form of write support over time


I found one case study on their blog - https://blog.readyset.io/medical-joyworks-improves-page-load...

> It will be interesting to see if any of these introduce some form of write support over time

Writes performed by your application in Readyset are automatically proxied(redirected) to your database.


What do you mean by write support here? Readyset will apply writes via a replication stream from the upstream database.


I think they mean acting as a write-through cache. If you send writes to the cache and have the cache delegate those writes to the underlying DB you get consistent invalidation, and if the write-through cache is smart enough it can even be transactional with the underlying source of truth.


Readyset does have some work and design done towards read-your-writes consistency, but it's still under development. The Noria paper does briefly touch on extending partial materialization with stronger consistency models via MVCC or other mechanisms as well.


What are some advantages to ReadySet versus read replicas from YugabyteDB or CockroachDB? A downside is that it appears to require a separate cloud subscription.


You can deploy on your own via their .deb packages - https://readyset.io/download

The advantages is that reading from a cache will be faster than from a read replicas. The benefits increase even further if you have to perform computation on the fetched data.


> ReadySet is licensed under the BSL 1.1 license, converting to the open-source Apache 2.0 license after 4 years.

What does this mean?


Not the OP, but BSL is Source License that gives you access to use the source, with specific restrictions.

Most commonly the restrictions prevent you from launching a competing offering. In their case, you can't offer database-as-a-service using their code.

BSL typically also restricts production use - though it looks like ReadySet has relaxed that restriction.

Finally, BSL reverts to a traditional open source license after a set period of time - in their case Apache 2 after 4 years. This means that code written today is licensed under BSL for 4 years, then automatically reverts to Apahce 2 thereafter.


The outdated security hole ridden useless code will eventually become Open Source.

In practice I would count on either using software in compliance with BSL restrictions (which are generous) or seek commercial license




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: