Hacker News new | past | comments | ask | show | jobs | submit login
Infrastructure SaaS – a control plane first architecture (thenile.dev)
80 points by infra_dev on June 22, 2022 | hide | past | favorite | 31 comments

Hey - I don't want to be critical, but I think you should rethink your messaging.

I'll be honest with you: I thought this was a parody. It's SO abstract.

It's like you came from doing this abstract thing inside a big company & decided to do the same abstract thing as a startup. And describe it using the specific terminology used inside that specific team of SuperBigCo.

This is Ram, the author of the post. Thanks for the feedback. The architecture that is mentioned in the post is something we built at a startup for our providing our infrastructure as a service. It is true that this architecture gets complex in larger companies. I would love to understand what parts are abstract and how we can improve them. We plan to publish a series of posts to provide more clarity on the different parts mentioned in the blog. Your feedback would be really helpful.

Who is this for? What does it do? Why should I care? I ask all of these non-facetiously as someone who works in enterprise cloud architecture.

For recent projects I've been using PostgREST + postgresql-replicant + Kafka (or some durable message queue as appropriate).

I don't have to work too hard to implement the control plane as PostgREST takes care of generating the API and Postgres already has authorization controls built-in. Authentication is the part that requires a bit of toil to figure out. The rest is managing the schemas of the control plane entities. Basically, design the data and most of the rest is generated for me.

The data plane is trickier but I'm experimenting with using Postgres' streaming logical replication protocol to convert logical queries on the control plane data into business domain events that are forwarded onto the message queue. This is the part that uses the postgresql-replicant library I wrote but other libraries written in Python exist and can do the same thing.

This then enables me to implement business logic/data-plane actions asynchronously down-stream as isolated, stateful services that follow the event streams and react accordingly to various policies. They can then update the control plane models as they progress, which then could add more domain events to the stream, and so on. It's a bit like a functional-reactive architecture.

I don't know if it's a production ready style architecture. Monitoring replication stream performance can be tricky and integration testing is challenging. And managing changes to the business domain is not a fully-solved problem: still a lot of exploration/tooling/and do-it-yourself duct-taping to do.

But it's simple enough that I can do lot of work with very little code so far.

Hey, this is Ram, the author of the post. Would love to know how this architecture works in production for you. For infrastructure saas, the control plane needs to manage 100's of data planes. It also needs to own user and tenant management, security policies, billing based on usage, usage and operational insights for your users. The challenge is in building an infrastructure that can centrally manage the lifecycle of all the metadata (SaaS, app and infra) and orchestrate with all the data planes. Postgres is definitely a good building block to build on top of but there is so much to build around it to make this work well in our experience.

Managing the data plane is the more complex part of it as you would expect.

I do tend to model the business processes at a high level using domain-driven design and map the aggregates to services that ingest event streams. The services then react to the event streams in several ways: emitting events, updating control plane models, issuing new commands to other services, etc.

Each service keeps its own state internally and if I need to, I can blow away their state and replay all of the business domain events thanks to the durable message stream. That part is key... and is also the most duct-tape-and-toil area of this architecture.

I've been toying around with ideas to generalize this into a consolidated application framework but it's still pretty experimental stuff.

The high-level architecture isn't terribly novel or new but having standardized tools for common operations, managing migrations from event schemas, managing checkpoints, etc; is still a work in progress.

“Data plane” and “control plane” aren’t terms I’ve seen before, and I’m having trouble understanding what they are, even after reading the post.

Can you explain them in a more concrete, conversational way?

Does this service let me e.g. take any docker image and turn it into a SaaS, handling user accounts and billing etc?

These terms usually show up in the context of networking protocols. Cloudflare has a very quick explainer: https://www.cloudflare.com/learning/network-layer/what-is-th.... To make it even shorter: a control plane is where all the coordination that controls activity (data) happens. The data plane is where the data actually moves around.

Explainers seem to not cover _why_ you would want to separate these "planes". There are several reasons, and I'm no authority, but for starters: * control messages will have different expectations around them: their amount and frequency, delivery guarantees, urgency with which they are processed. Treating this traffic separately means you can engineer appropriately for data and control traffic. * last thing you want is the control message "stop processing traffic from IP x.x.x.x port y" to be stuck behind traffic from said IP/port...

In this context, the meaning is somewhat different. They are referring to administrative traffic vs "actual work" traffic. Auth, billing/accounting, configuration updates, that sort of thing. If you are running a SaaS, and your customer is very security conscious and wants none of their precious data to ever leave their VPC, you have 2 options: deploy your software into their VPC completely, making it hard to do a variety of things like upgrades, and increasing complexity; or you can separate control actions from your "worker nodes" and storage, and only deploy the latter into the VPC. You can then work on your control panels, monitor usage, continuously evolve various admin panels and config options, etc, using normal SaaS approaches while the security conscious customer knows that their core data is not leaving their virtual walls and only "bob ran a thing and stored results" goes to the vendor.

This post is about abstracting out common bits of how one implements that, and allowing SaaS offerings to provide that sort of separation easier.

Awesome explanation, thanks! (Particularly "last thing you want is the control message "stop processing traffic from IP x.x.x.x port y" to be stuck behind traffic from said IP/port..")

I should've added, there's an obvious example for the "SaaS control plane" separation, which is equivalent: "stop processing job X that is destabilizing the cluster" should be processed without needing to fight for resources with job X. Same for ACL changes, user deactivations, etc etc. It's generally a good idea to have your control stuff not be subject to whatever instabilities you might be controlling against.

Hey ed, this is Ram, the author of the post. In the context of Infrastructure SaaS, a data plane is the system that is the infrastructure that you provide as service. For example, let us say you are building a company that provides Postgres as a service. In this case, Postgres is your data plane. Typically, your users will want the Postgres to be deployed in a specific region or cloud provider.They would run queries against the Postgres cluster.

Control plane is the central lifecycle management system that helps provide all the SaaS experience for your Infra SaaS application, manages the metadata for your application and also pushes this information to all the data planes. Example of lifecycle management operations could be creating an user, a new organization, provisioning your data plane in a specific region, deleting a cluster etc.

The data plane is your product that you want to sell to your customers and control plane is the central system that helps you to make your product work in a self serve way with your customers.

This example can also be mapped to internal use cases. Many companies manage their own infrastructure internally and end up having to build a central control plane to manage all the different infrastructure that they provide as a service to their developers. Hope this helps.

Data plane is the data, what you traditionally think of a service's inputs and outputs. Eg, a database server gets queries and returns rows/results. That's data plane. But when a system becomes large enough, the management stuff adds up to be notable in and of itself.

Pretend we're a SaaS company offering a database as a service. Adding and removing users, and the setting of passwords is control plane stuff. In a sufficiently web scale system, adding and removing users becomes, not just its own microservice, but a collection of microservices to authenticate and send updates to the main product database, and have its own separate database.

In the old days, we'd call this the "app" and the "management" (or "admin", or "provisioning") interface. The app is the application(s) providing the actual service your SaaS offers, the management interface manages/configures the app, handles migrations, updates, etc. Example: Maybe your app needs a separate DB per tenant. Your management thing handles spinning that up, etc.

I understood a data plane is for example the MySQL/pgsql/redis servers of all customers of your db as service

think old school ftp ports, 1 for data (your actual files being transferred) 1 for control (out of band messaging).

I work at a somewhat well-known unicorn in the data space that has been using this architecture for a while. In fact I'd wager that any (non-cloud-provider) company that provides any significant amount of compute or storage in their product offering will converge upon a layout that closely resembles this.

Overall I'd imagine there are a lot of parallels to other SaaS-ish architectures, one big divergence is that I'd consider the data-plane to be a special kind of client, (client in the same way that a user's phone or browser is). The big difference is that we (the company) ALSO manage the lifecycle of this "client" (i.e. shutdown, startup, repair, update). Having an untrusted client that you also manage the lifecycle of can lead to some interesting design spaces.

Managing untrusted clients seems extremely challenging. I think these days you'd use something more "sandboxed" like WASM as the basis for the client?

I definitely agree with all the points here and the approach is correct, I've had the same thought before seeing project authors struggling to build cloud versions of their products.

But, this raises some important questions. You're essentially outsourcing your entire company. Think about it, especially for an open source project, the main way you make money is build a cloud version of you product. If someone else is doing that for you, what are you left doing? It's a dangerous place to be in.

What's to stop the owner of this service to realize he can cut you out from the middle, and just build a service out of your product himself?

E.g. that's a lot of AWS's M.O. recently; they have really robust internal infrastructure to spin up managed services for any project, and they make a lot of money doing that.

Maintainers should be given the tools to monetise their projects using a cloud offering early on so they don't burn out on the way to learning what control plane/data plane nomenclature actually is!

Unfortunately there are limited tools/resources out there to answer the question "how can i build a cloud deployment option for my open source project" without 25 layers of abstractions. This is why open-source projects end up raising millions of dollars (e.g strapi, appsmith just off the top of my head). All this money just for all these companies to essentially build the same thing.

Ideally, there should be a service/tool (maybe thenile will be it) where I answer a few questions.

Containerised? Stateful? Long lived? Keep alive? Allow end users to deploy many instances? % premium per instance over cloud costs? License per instance? Pricing (tiered, volume, stairstep?) On end customer's own infra? Min resources per instance. Airgapped per user/airgapped per instance/all instances running on same cloud? Instance management API? Update strategy? Big Green button to update all instances at once?

And it should spit out a ready to deploy setup where I can start monetising my open-source project while concentrating on maintaining the project.

If the above exists then congrats, you disrupted the main reason for open source projects raising millions of dollars.

Hey smashah, this is Ram, the author of the post. Everything that you said is pretty spot on. It is a really hard problem and pretty repetitive in many of the companies including open source projects. We hope to build a platform that can help with this and over time cover all the use cases you mentioned.

How do you think about partner versus build? Usage based billing on its own is a pretty complex topic as an example with multiple startups that have tried to do it right. It is pretty data intensive as usage frequency increases, as well. It seems like you may be trying to do too much here. Just one opinion but something to think through (build versus partner.)

I saw a pretty good article the other day about the some of the difficulties on the build side. Might be worth taking a look https://www.getoctane.io/blog/usage-based-billing-complexity

Specifically for API, seems better to just partner, focus on building core product and getting users.I've seen tools out there to manage usage based billing and auth for APIs, archetype.dev is one. Seems like no brainer vs. trying to spend eng hours in house to figure out, thoughts ?

Sounds insanely difficult to get right

YES! Imagine trying to do all that while maintaining and growing the source project. This is why I've never been able to get there with my project. Instead I have to make deployment buttons where DO makes all the revenue and their broken referral system doesn't even give me anything back.

One shortcut with all this is just to allow maintainers to surcharge a premium with deploy buttons.

I think [1] is a good definition of the architectural pattern (outside of networking applications). It’s the one we use within the crossplane community [2]

[1] https://aws.amazon.com/builders-library/avoiding-overload-in...

[2] https://crossplane.io

Why is this specific to infrastructure SaaS? Most of what you write applies to any B2B SaaS, doesn't it?

Great question! For a typical B2B SaaS, you typically will have a multitenant deployment in one region. The control plane APIs and data plane APIs will run in the same region. Each new customer will create a logical tenant in your DB but there are no physical data planes created.

For Infrastructure SaaS, it is a bit different. You typically will have different customers provision your infrastructure in different cloud or regions depending on where they have their infrastructure. This leads to having many physical infrastructure deployed in many regions and cloud providers. At the same time, for the user, you need to provide a single pane of glass experience where they can manage all their infrastructure from a single dashboard. This requires a central control plane that is responsible for all the life cycle management operations and it helps to communicate all the metadata back and forth to all the data planes. Things like upgrades, observability, user and tenant management all need coordination with the data planes. This makes the Infrastructure SaaS use case a bit different from standard B2B SaaS. Hope that helps.

I'm also a bit confused by the terms used here. Is Replicated (https://www.replicated.com) an example of the kind of "control plane" you're referencing?

This is Ram, the author of the post. We have not unveiled our product yet. We are currently focussing on talking about the architecture that Infrastructure SaaS companies have to build and scale. The control plane and data plane are the critical parts of the architecture. We plan to post more blogs about different topics around these architecture patterns. I would love to chat 1:1 about what we are building if you are interested. My email is ram@thenile.dev

Terraform with a frontend? I honestly would like to know the elevator pitch here.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact