Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Serverless Postgres (github.com/kiwicopple)
159 points by kiwicopple 3 months ago | hide | past | favorite | 61 comments
This is a MVP for Serverless Postgres.

1/ It uses Fly.io[0], which can automatically pause your database after all connections are released (and start it again when new connections join).

2/ It uses Oriole[1], a Postgres extension with experimental support for S3 / Decoupled Storage[2].

3/ It uses Tigris[3], Globally Distributed S3-Compatible Object Storage. Oriole will automatically backup the data to Tigris using background workers.

I wouldn't recommend using this in production, but I think it's in a good spot to provoke some discussion and ideas. You can get it running on your own machine with the steps provided - connecting to a remote Tigris bucket (can also be an AWS S3 bucket).

[0] https://fly.io

[1] https://www.orioledb.com/

[2] Oriole Experiemental s3: https://www.orioledb.com/docs/usage/decoupled-storage

[3] Tigris: https://www.tigrisdata.com/




This seems more like "PostgreSQL using remote storage for the database" rather than "Serverless"?

It seems like it still needs memory/cpu resources locally (sized properly) to handle the returned data.

It also seems like an incredibly bad idea to use with any remote data store that charges for network traffic. ie Amazon AWS

You could easily give yourself a many thousand dollar bill for a month if you start using that backend data store in a serious way. :( :( :(

---

That being said, it's a nifty idea conceptually. Just be aware of likely operational costs if you try using it. (!)


I guess the term “Serverless” is pretty ambiguous, but since this originated from AWS Lambda (firecracker) and that’s the same primitive as Fly.io then it seemed fitting. You can scale up/down compute and snapshot memory for fast startups

The other tricky thing is data: there are several ways to handle this. Since Tigris has a lot of the data replication features that Postgres would traditionally handle then it looks promising. That said - it comes with tradeoffs that you point out (transfer costs) but these are also costs that you’d pay for postgres replication. I’ll try benchmark some of these costs and compare them across several implementations. At this stage I just wanted to get an MVP working: the harder stuff is still ahead

Thanks for your comment - a lot of good points


> these are also costs that you’d pay for postgres replication.

Some places define a 2nd (private) network interface for their VMs, and you can transfer data between your VMs over that private interface without cost.

Not sure if AWS does it that way though.


> I guess the term “Serverless” is pretty ambiguous, (...)

I don't this is the case though.

The core principle of serverless computing is that a cloud provider supplies a managed service and scales computational resources allocated to the service to meet it's computational needs.

This "serverless postgres" meets none of the items in the definition.

You mentioned AWS Lambdas. AWS allocates all computational resources when a function is invoked.

Let's face it: this was an attempt to abuse a buzzword.


Fly machines allocate all the computation resources when a TCP connection is established. I feel like this is the same?


I think you're being unduly harsh.

The term "serverless" is broad enough that it's use here is fine.


considering that this is an experimental hack, I don't think it's reasonable to get this hung up on semantics or to throw accusations. especially over a term as close to meaningless as this.

it scales to zero and manages its own computer resources, that's a lot closer than a lot of "serverless" products ever get


I use Fly.io, it’s not serverless at all, it’s a PaaS that manages your servers for you via its own automations and cli. Fly also already supports postgres (I use this too) so I’m not sure why I would use this? To me the whole point of Fly is minimal work :)


When you say it like that, nothing really is serverless. Everything is running on some computer.

Serverless architectures are provided by servers.


Serverless is like AWS lambda, when you don’t have an explicitly running VM that you pay per time; instead your code is run in “a cloud” and you pay per invocation.

This is nothing new! “Serverless” is well defined and understood for years and years!


This is actually what Fly machines do under the covers. When you use it as a PaaS, the client CLI sets all this up for you, and then our proxy starts/stops things as needed.


separating storage from compute is kind of the definition of "serverless postgres", right?


> separating storage from compute is kind of the definition of "serverless postgres", right?

Not really,unless you want "serverless" to be meaningless.

Think about it for a second: would it make any sense to call S3 a serverless anything? And NFS? And what about gmail? Is gmail server less as well?


S3 is absolutely "serverless" storage in the same way Lambda is "serverless" compute: it's a service that abstracts away the underlying servers.

If the host is managing servers in a way that makes it impossible for me to even know how many servers are behind my usage of it, it's serverless.


> S3 is absolutely "serverless" storage (...)

I think this take is ridiculous and it tries to bring absurdity to a label.

Storing data on a third-party computer is not "serverless". Gmail is not a serverless messaging service. Flickr is not a serverless photo service. HN is not a serverless message board. Just because you store data on a third-party computer, that does not mean you're dealing with serverless anything.

https://en.wikipedia.org/wiki/Serverless_computing


We don't call Gmail, Flickr, and HN "serverless" because we're not trying to distinguish them from managing servers and because they're not infrastructure.

Wikipedia defines serverless as follows:

> "Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers."

What part of that definition doesn't apply to S3? Amazon themselves[1] define S3 as serverless. Your linked article also says that Cloudwatch and other AWS servers are serverless as well.

I'm sorry(?) that serverless is a broader term than you realized. It was always a somewhat absurd label.

1. https://docs.aws.amazon.com/whitepapers/latest/serverless-mu...


We all get to pick which hills we want to die on but this is a pretty silly one IMHO.

You can fight the common usage of a word or you can accept it. Yes, of course there is still a server involved but not one the developer has to directly manage or even think about.

If you don’t manage the server directly AND you can scale to 0 (costs and compute, aka not be forced to pay for an instance for a managed service) then it’s “serverless”.


Not for most of the computing history where "serverless" was an actual word. But maybe some crowd is attempting to redefine it (yet again?). :)


what did it mean historically? in contemporary usage, it seems to be a synonym for "managed" (or more accurately, a meaningless sobriquet for PAAS that can be easily sold to the huddled masses of "full stack" boot camp grads)



>citus is a bit out of date, now mostly native azure not flexible at all >neon has not released all of their kubernetes operator so its not fully open source yet > Crunchy is just a kubernetes operator for HA postgres. but they are working on a cool thing to natively use IceBerg as FDW. But their attitude is that this serverless separation of compute and storage is not the right approach for the primary operational data store


At very first glance this would be much closer to neon, with the separated storage and compute.

Crunchy Postgres for Kubernetes is great if you're running Postgres inside Kubernetes, but is more of standard Postgres than something serverless. Citus also not really serverless at all, Citus is more focused on performance scaling where things are very co-located and you're outgrowing the bounds of a single node.


Neon has a consensus cluster of >=3 WAL storage servers using local storage that is not scale-to-zero. Only with multitenancy can you amortize it to scale-down-to-not-much.

It's not clear to me that kiwicopple's work horizontally scales compute, ever. It seems it has to be just one server, and if you ran multiple they'd corrupt the S3 storage.

Neon: >= 3 nodes

kiwicopple/serverless-postgres: 0..1 nodes


I think the only relevant comparison here is Neon.


Novel, and maybe I am just old school, but I truly believe that a right sized Postgres deployment is good enough for most use cases. I would trust my data to something vanilla barring a super niche scenario.


A vanilla Postgres or MySQL instance, when properly tuned, will take you EXTREMELY far. People forget how much latency network hops add compared to native disks.

Even improperly tuned, it’ll take you much further than one would be led to believe from the SaaS shills.


Neat stuff. AWS threw in the towel with their serverless PG (Aurora v2 can no longer scale to zero).


Their serverless Aurora was always terrible, even v1. It cost way too much and didn’t live up to the promise. I left them as they went to v2 (double the price, even worse features) for PlanetScale before then moving to Neon (I still like PS but it wasn’t a good fit for my business/tech model).


Can this run on AWS Lambda with a EBS Blockstore?

Mentally this is the cheapest w/o all those egress costs, but I'm not sure. Combine that with we don't like managing resources outside AWS unless absolutely forced.

I really wanted this because RDS serverless v1 could do zero compute, but v2 has a minimum compute requirement.

If I can run my Postgres reliably on AWS Lambda, the costs drop to near zero for non production instances, which is where a lot of people play.


Since this is dockerized, the best AWS primitive would be Fargate or ECS

As far as I know you can’t mount a file system like EBS to Lambda, and tbh even if you could the implementation of Lambda+Postgres sounds particularly cursed.


Ah sorry I should've said EFS, got my acronyms wrong. That one is mountable to Lambda.

But Lambda can't accept multiple connections to the same instance, so it's allowing multiple pg servers connect to the same data source.


Why EBS and not the decoupled storage that this does?


Ah sorry I should've said EFS, got my acronyms wrong.


Well

A. EFS is dogs*** for this

B. The EFS protocol is NFS, so that's the question.


Apparently EFS is better for storage latency (no personal experience): https://lumigo.io/blog/unlocking-more-serverless-use-cases-w...

But on the other hand EFS blocks Lambda's JVM fast start mode that is based on Firecracker VM snapshots. Not that it matters for Postgres...


For a moment I was excited to see sqlite with postgres compatibility.


Could you use this to create 100s of on-demand read replicas and get 100s of GB/s out of S3? Seems like a nice way to cheaply and quickly do batch jobs.



For $1,500 per TB per month? No thanks.


Serverless != serverless

You use a lot of servers, you just call them resources. Your S3 storage is a server, your background workers run on a server, and your postgress, well, it doesn't run on a bike.

I hate this confabulating stuff. It's like people believing "the cloud" is an actual cloud. Couldn't we just leave this stuff to the marketing people?


Nevertheless serverless is an actual concept where you don't deploy any server instances, you use only services. Those services might scale up with server instances, but you don't have to know about that.

I was against it too at first, but then I realized how to use it.


Cool idea, but it sounds dangerous. Data is often the most valuable thing for the company / project / app.


I am curious why this sounds dangerous. Data is still getting persisted to storage. It’s just a different architecture where compute and storage are not colocated on the same machine.


Instead of all this cloud native stuff, I wish there was an embedded version of Postgres, ala sqlite or duckdb



For what purpose? A lot of folks use Testcontainers to embed Postgres in tests.


distribution purposes, a single binary is easier than a docker container and its associated security issues.


Postgres is pretty heavy BECAUSE it's a network database. What would make you want embedded PG instead of sqlite?


When would it make sense to use such a setup? What kind of applications?


The “scale to zero” aspects mean that it’s useful for very infrequently accessed applications

Once I have experimented with Tigris’ global data replication then it could be useful for a distributed user-base doing a lot of reads. The idea is that you can point a fresh postgres instance to the S3 bucket for read access (rather than keeping a hot standby using Postgres native replication)


In addition to infrequent access type use-cases, the ability to trivially create new DBs without incurring significant costs is the killer feature.

This makes it feasible to build multi-tenant applications with 1-db-per-tenant. I know the same can be achieved with RLS policies or application-layer permissions, but nothing beats the peace mind (for developer and customers alike) of having tenant data in entirely separate databases. This also makes it easier to have custom schema for different subsets of users. Side note: SQLite (the OG serverless DB) also has the same benefit but is decidedly less feature-full than postgres/mysql.

Another pattern that I have seen emerging is that certain products allow end-users to create DBs on the fly as a part of their own product offering. For example, Retool (which uses Neon) allows users to create a new database in seconds, even on their free plan. I don't think this would have been possible (both cost and DX wise) if they provisioned a new Postgres cluster for each customer.

On that note, and in light of what happened with the Planetscale free plan, this would make the Supabase free offering and branching to be significantly less expensive to operate. btw: thank you for the free plan and all the amazing OSS contributions from the supabase team!!


You have data updates that happen at irregular intervals and don't want to be bothered with turning on and turning off a db instance.


why keep separate postgres instance for this use case?

One could easily have a single small Postgres instance for a few dollars that hosts all these small scale peanut sized toy databases and forget about it.

if your main business critical database is not doing transactions and stays idle, you should be working on bringing more transactions (sales sales sales) - not trying to save pennies on downsized postgres by moving from SSD-fast postgres to S3-slow-postgres.

I get that there are benefits to scaling up/down, but to me: Serverless is a billing model for multi-tenant services, where user is billed by service usage, not the actual underlying resource consumption. Because in multi-tenant SAAS all your tenants share the underlying infrastructure


I like to think this sort of project makes sense when putting screendoors on submarines.


Very cool! I’m glad to see someone working with oriole, been wanting to do something similar for a long while.

How tricky is it to with the S3 storage? Esp. In regards to permissions, etc.?

Thanks for sharing eh, love what you’re doing


I think it’s pretty simple with S3, and the permissions are basically “keep the bucket private”

That said, this is very much an MVP and I’m sure I’ll encounter some dragons as I push for scale or as I receive feedback from customers


Very cool.


now it would be mega cool if someone got a grant to do this with the open source stack. fly uses firecracker underneath etc...


Oh fun this is from the CEO of supabase, so i guess that's where the grant is coming from :))))


double funny, looks like they already acquihired the main devs from Orioledb

https://github.com/akorotkov now supabase https://github.com/pashkinelfe now supabase

silicon valley isn't capitalism awesome



[deleted]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: