Hacker News new | past | comments | ask | show | jobs | submit login
Skip the API, ship your database (fly.io)
301 points by danielskogly on Sept 13, 2023 | hide | past | favorite | 250 comments



If you give access to your DB directly, your API effectively becomes your API with all the contract obligations of the API. Suddenly you don't completely control your schema: you can't freely change it, you need to add things there for your clients only. I've seen it done multiple times and it always end up poorly. You save some time now by removing the need to build API, but later you end up spending much more time trying to decouple your internal representation from schema you made public.


Absolutely correct, listen to this article's ideas with great scepticism!

The system that I'm currently responsible for made this exact decision. The database is the API, and all the consuming services dip directly into each other's data. This is all within one system with one organisation in charge, and it's an unmanageable mess. The pattern suggested here is exactly the same, but with each of the consuming services owned by different organisations, so it will only be worse.

Change in a software system is inevitable, and in order to safety manage change you require a level of abstraction between inside a domain and outside and a strictly defined API contract with the outside that you can version control.

Could you create this with a layer of stored procedures on top of database replicas as described here? Theoretically yes, but in practice no. In exactly the same way that you can theoretically service any car with only a set of mole-grips.


This is just an interface, and you have the same problems with versioning and compatibility as you do with any interface. There's no difference here between the schema/semantics of a table and the types/semantics of an API.

IME what data pipelines do is they implement versioning with namespaces/schemas/versioned tables. Clients are then free to use whatever version they like. You then have the same policy of support/maintenance as you would for any software package or API.


> There's no difference here between the schema/semantics of a table and the types/semantics of an API.

There is a big difference. The types of an API can be changed independently of your schema.


You're looking at the wrong layer. If we were to go to the layer you're talking about, we'd have internal and external tables where we could change the structure of the internal tables, and the rebuild/rematerialize the external tables/views from the internal ones.


If the external tables are views that can combine select columns from multiple tables with computed fields - maybe. In theory it’s good, in practice I’ve never seen it done well.


I do think tools to manage this stuff... basically don't exist, so I'm sympathetic to the argument that while there's mostly equivalency between data and software stacks, software stacks are way more on the rails than data stacks are. Which is to say, I have seen this stuff work well with experienced data engineers, but I think you need more experience to get the same success on the data side than you do on the software side.


Yeah, I could see that. It’s not common and the tooling is primitive. Same thing I would say about event sourcing. Great in theory, but it’s more likely to get your average team into trouble.


That’s the critical point - in theory this idea is fine.

In reality other ways of solving the same problem have a decade of industry knowledge, frameworks and tooling behind them.

Is the marginal gain from this approach being a slightly better conceptual match for a given problem than the “normal way” worth throwing away all of that and starting again for?

Definitely not in my opinion. You’ll need to spend so much effort on the tooling and lessons before you’re at the point where you can see that marginal gain appear.


> That’s the critical point - in theory this idea is fine.

I've worked on production systems where this kind of stuff worked very well. I think there's weirdly a big wall between software and data, which is a shame, because the data world has a lot to offer SWEs (I've certainly learned tons, anyway).

> In reality other ways of solving the same problem have a decade of industry knowledge, frameworks and tooling behind them.

It's pretty likely that any database you're working with is as old or older than any software stack. Java, PHP, and MySQL were all released in '95 (Java and MySQL on the very same day, which is wild), PostgreSQL was '96. Commercial DBs are even older, SQL Server is '89, Oracle is '79, DB2 and SQL itself is 70s. There's a rich history on the data side too.

> Is the marginal gain from this approach being a slightly better conceptual match for a given problem than the “normal way” worth throwing away all of that and starting again for?

The gain is pretty tremendous: you don't need an app server, or at least you only need a very thin one. Tech has probably spent billions of dollars building app servers over the last 30 years. They're hard to build and even harder to maintain. Frankly, I'm tired of stacking up huge piles of code just to transpile JSON/gRPC to SQL and back again.

> Definitely not in my opinion. You’ll need to spend so much effort on the tooling and lessons before you’re at the point where you can see that marginal gain appear.

There's a lot of tooling, it's generally just built into the DB itself. And a lot of software tools work great with DBs. You can store your schemas and query libraries in git. You can hook up your CI/CD pipeline right into your database.

I also can't recommend dbt enough [0]; it's basically the best on-ramp for SWEs into data engineering out there.

[0]: https://www.getdbt.com/


Versioned views, materialized views or procedures are the solution to this. It is frequent that even internally, companies don't give access to their raw data but rather to a restricted schema containing a formated subset of it.


Views will severely restrict the kinds of changes you might want to do in the future. For example now you can't just move some data from your database into S3 or REST service.

Stored procedures technically can do anything, I guess, but at that point you would be better with traditional services which will give you more flexibility.


A view can also do anything - it could query a REST service, for example. (Not saying that this is necessarily a good idea, though...)


Is that a real thing? What DBMSs support such views?


Most of the heavy artillery RDMSes at least, eg Postgres let’s you mount arbitrary HTTP resources as tables, which you then can put views over: https://wiki.postgresql.org/wiki/Foreign_data_wrappers


This sounds like a total minefield. You might get a response in the same format, but I imagine it'd be very easy to break an API user who accidentally depends on performance characteristics (for example).


AWS RDS and Aurora both support synchronous and asynchronous lambda invocations from the database. Should be used very carefully, but when you want/need a fully event-driven architecture, it's wonderful.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...


It is a _phenomonal_ waste of money in licensing fees but MSSQL server can embed C# dlls and as a result run arbitrary code via its CLR integration


Postgres can run arbitrary code too, but at this point it just makes more sense to create a service that acts as a database and translates sql to whatever (people already do that). This however makes whole game pointless, as we are back where we were.


Of course it’s possible, but now you need more people with DB and SQL knowledge.

Also, using views and stored procedures with source control is a pain.

Deploying these into prod is also much more cumbersome than just normal backend code.

Accessing a view will also be slower than accessing an “original” table since the view needs to be aggregated.


> Accessing a view will also be slower than accessing an “original” table since the view needs to be aggregated.

Where does it say anything needs aggregating. You can have a view that exists just for security.

> Also, using views and stored procedures with source control is a pain. Deploying these into prod is also much more cumbersome than just normal backend code.

Uh? This is normal backend code.


I don't see the problem here.

Are modern developers allergic to SQL or what is the issue?


Not all devs are proficient in SQL. Its another skill that is required.


In addition if you are using postgres, then there is postgresRest to make api really quick and nice.


why would you want to develop your api in sql over a traditional language?

versioned views and materialized views are essentially api endpoints in this context. just developed in sql instead of some sane language.


A lot of my backend career has been essentially Greenspun's 10th Law: Any sufficiently complicated REST or GraphQL API contains an ad-hoc, informally-specified bug-ridden slow implementation of half of SQL.

SQL is a legendary language, it's powerful enough to build entire APIs out of (check out PostGraphile, PostgREST, and Hasura) but also somehow simple enough that non-technical business analysts can use it. It's definitely worth spending time on.


You can let your API's users do arbitrary queries on any sets of data you expose if your API exposes SQL views directly. Plenty of apps don't need anything beyond basic "get by id" and "search by name" endpoints, but plenty others do. At that point, with traditional backends, you're either reimplementing SQL in your API layer, or creating a new endpoint for each specific usecase.


> Versioned views, materialized views or procedures are the solution to this.

Wouldn't it be far simpler to just create a service providing access to those views with something like OData?


Whether it's method calls or database schema - isn't what really matters is control of what's accessible and the tools you have to support evolution?

So when you provide an API - you don't make all functions in your code available - just carefully selected ones.

If you use the DB schema as a contract you simply do the same - you don't let people access all functions - just the views/tables they need/you can support.

Just like API's, databases have tools to allow you to evolve - for example, maintaining views that keep a contract while changing the underlying schema.

In the end - if your schema dramatically changes - in particular changes like 1:1 relation moving to a 1:many - it's pretty hard to stop that rippling throughout your entire stack - however many layers you have.


> Just like API's, databases have tools to allow you to evolve - for example, maintaining views that keep a contract while changing the underlying schema.

What are the database tools for access logs, metrics on throughput, latency, tracing etc.? Not to mention other topics like A/B tests, shadow traffic, authorization, input validation, maintaining invariants across multiple rows or even tables...

Databases often either have no tools for this or they are not quite as good.


- Access logs: audit logging [0]

- Throughput/latency: pg_stat_statements [1] or Prometheus' exporter [2]

- A/B tests: aren't these frontend things? recording which version a user got is an INSERT

- Auth: row-level security [3] and session variables

- Tracing, shadow traffic: I don't think these are relevant in a "ship your database" setup.

- Valdation: check constraints [4] and triggers [5]

Maybe by some measures they're "not quite as good", but on the other hand you get them for free with PostgreSQL. I've lost count of how many bad internal versions of this stuff I've built.

[0]: https://severalnines.com/blog/postgresql-audit-logging-best-...

[1]: https://www.postgresql.org/docs/current/pgstatstatements.htm...

[2]: https://grafana.com/oss/prometheus/exporters/postgres-export...

[3]: https://www.postgresql.org/docs/15/ddl-rowsecurity.html

[4]: https://www.postgresql.org/docs/15/ddl-constraints.html

[5]: https://www.postgresql.org/docs/15/plpgsql-trigger.html


Honestly, there's plenty of tools out there that can do the same thing.

The important crux of the counterpoint to this article is "if you ship your database, it's now the API" and everything that comes along with that.

All the problems you _think_ you're sidestepping by not building an API, you're actually just compounding further down the line when you need to do things to do your database other than simply "adding columns to a table". :\

Edit: re-reading, the point I didn't make is that having your database be your API _is_ viable, so long as you actually treat it as an API instead of an internal data structure.


You can do impedance-matching code in a database, e.g. in stored procedures, but I think the experience is strictly worse than all the application-level tooling that's available.


Not sure what you mean.

Are you just taking about the expected shape of the data - the consumer of the database can do that either in SQL or at some later layer they control.

If you are talking about my 1:1 -> 1:N problem. I'd argue that can ripple all the way though to your UI ( you now need to show a list, where once it was a single value etc ) - not something you can actually fix at the API level per se.

Bottom line, the more layers of indirection, the more opportunities you have to transform - but potentially also the more layers you do have to transform if the change is so big that you can't contain it.

Let's be clear - I'd typically favour APIs -especially if I don't control the other end. But I'm saying it's about the principals of surface area and evolvability, not really whether it's an API or SQL access.


I have spent my entire, long, career, fighting against someone who thought this was a good idea, unpicking systems where they implemented it or bypassing systems where this was implemented. It's a many-headed hydra that keeps recurring but rarely have I seen it laid out as explicitly as this headline.


I guess that's what one gets for reading just the headline? TFA talks about the downsides called out in this thread explicitly.

tbf, the idea isn't as novel. Data warehouses, for instance, provide SQL as a direct API atop it.


I have, in fact, read the article, and they are _vastly underestimating_ the importance of those downsides. For instance, I once dealt with an issue that involved adding a column to a table, which they think shouldn't be too bad, that took two actual years to resolve because of all of the infrastructure built on top of it that bound directly to the table structure.


But surely the problem is with the infrastructure that can't deal with an extra column - not the db/table itself?

If all the users of an API were bound to the shape of the data returned, but you wanted to add an extra field, you'd have exactly the same problem surely?

Sounds like the problem was with too much magic in the layers above - as in the end the shape of the data returned from a query on a table is up to the client - you can control it directly with SQL - in fact dealing with an extra column or not is completely trivial in SQL.


I mean, exactly, that’s why this is a bad idea. Adding a column is simple, having your DB be your API is madness. The more magic you add, the worse it gets.


I think you are mixing two problems - having the part of the DB exposed, and whether people then build brittle stuff on top.

My point is that if people build brittle stuff on top that's not a problem of the DB being accessible per se.

That could just as easily happen against an API.

I assume you had was some problem with ORM's and automatically built data structures etc - I would argue that's a problem with those, not with the DB.


> If you give access to your DB directly, your API effectively becomes your API with all the contract obligations of the API. Suddenly you don't completely control your schema: you can't freely change it, you need to add things there for your clients only. I've seen it done multiple times and it always end up poorly.

In a past life, I worked for a large (non-Amazon) online retailer, and "shipping the DB" was a massive boat anchor the company had to drag around for a long time. They still might be, for all I know. So much tech and infra sprung up to work around this, but at some point everything came back to the some database with countless tables and columns where no one knew the purpose, but couldn't change because it might break some random team's work.


That's [another reason] why you use stored procedures and only call them (rather than hardcoded or ORM-generated SQL queries) in your client app code.


I think this point is addressed in the article.


Came here to say this too.

From the article:

> A less obvious downside is that the contract for a database can be less strict than an API. One benefit to an API layer is that you can change the underlying database structure but still massage data to look the same to clients. When you’re shipping the raw database, that becomes more difficult. Fortunately, many database changes, such as adding columns to a table, are backwards compatible so clients don’t need to change their code. Database views are also a great way to reshape data so it stays consistent—even when the underlying tables change.

Neither solution is perfect (raw read replica vs API). Pros and Cons to both. Knowing when to use which comes down to one's needs.


This 100%.

My last customer used an ETL tool to orchestrate their data loads between applications, but the only out of the box solution was a DB-Reader.

Eventually, no system could be changed without breaking another system and the central GIS system had to be gradually phased out. This also meant that everybody must had to use Oracle databases, since this was the "best supported platform".


On the next iteration some consultancy will replace that with a bunch of microservices using a dynamic language.

When that thing fails again they will hopefully settle on a sane monolithic API.


Yeah this is my gripe with things like Firebase Realtime Database.

Don't get me wrong, the amount of time it saves is massive compared to rolling your own equivalent, but it doesn't take long before you've dug yourself a big hole that would conventionally be solved with a thin API layer.


Also you shouldn't give up access to your DB for security reasons.

That's why API exists at first place.


PostgreSQL 9.5 (7.5 years old) shipped row-level security [0] which solves this.

[0]: https://www.postgresql.org/docs/15/ddl-rowsecurity.html


The architecture described in the article replicates the SQLite database on the page level.


Yeah but this thread became about "you need an API".


Technically you can create different users with very precise access permissions. Might not be the good idea to provide that kind of API to the general public, but if your clients are trustworthy, it might work.


No clients are trustworthy.


You could ship the database together with python/JS/whatever 'client library' - and you tell your clients that they need to use your code if they want to be supported.


You just know they're going to run custom code, fck up their database and then still complain.

I'm not tooo familiar with DBs, but I know customers. They're going to present custom views to your client SDK. They're going to mirror your read-only DB into their own and implement stuff there. They're going to depend on every kind of implementation detail of your DB's specific version ("It worked with last version and YOU broke it!"). They're going to run the slowest Joins you've ever seen just to get data that belongs together anyway and that you would have written a performant resolver for.

Oh, and of course, you will need 30 client libraries. Python, Java, Swift, C++, JavaScript and 6+ versions each. Compare that to "hit our CRUD REST API with a JSON object, simply send the Authorization Bearer ey token and you're fine."


This is the worst of both worlds. Not only are you back to square one, as you spent the time to build an API (client libraries), but now, if the API is limiting, the users will find ways of accessing the SQLite db directly.


Are you assuming clients will actually upgrade the library on a regular basis?


You can use stored procedures if you want to add another abstraction layer.


They had stored procedures in the "old days" when they figured out that direct access to the database was a bad idea, so what has changed? (I agree that a DB view often is good enough thoug, but they ALSO had that in the "old days", IDK what has changed about that:-p )


yeah reminds me of meteor JS


The title reads like it came from an MBA biz-bro that doesn't want to do anything properly because it wastes time and costs money. FWIW, I skimmed the article.

Building an API for a new application is a pretty simple undertaking and gives you an abstraction layer between your data model and API consumers. Building a suite of tests against that API that run continuously with merges to a develop/test environment will help ensure quality. Why would anyone advise to just blatantly skip out on solid application design principles? (clicks probably)


The guy knows what he's talking about [0].

> Building an API for a new application is a pretty simple undertaking

This is super untrue, backend engineering is pretty hard and complicated, and there aren't enough people to do it. And this is coming from someone who thinks it should be replaced with SaaS stuff like Hasura and not a manual process anymore.

> Building a suite of tests against that API that run continuously with merges to a develop/test environment will help ensure quality.

You can test your data pipelines too; we do at my job and it's a lot easier than managing thousands of lines of PyTest (or whatever) tests.

> Why would anyone advise to just blatantly skip out on solid application design principles?

Because building an API takes a lot of time and money, and maintaining it takes even more. It would be cool if we didn't have to do it.

[0]: https://github.com/benbjohnson


I've been doing this for a long time and all I can say this after reading this multiple times ... "I don't get it".

I mean, I get it, from a technical standpoint. Ok, so you're going to send read-only Sqlite databases to everybody.

Is it missing what the API (that you still need) is updating when you insert or update something and all client DBs are now stale? Is there a central database? How often are you pushing out read-only database replicas across the wire to all clients? Is that really less "chatty"? If so, how much bandwidth is that saving to push an entire database multiplied by the number of clients?

None of this seems logical. Maybe I'm missing the real-world use-case. Are we discussing tiny Sqlite databases that are essentially static? Because in the last 30 years I've not run into a situation where I needed to have clients execute SQL queries on tiny, static databases let alone still need to potentially update them also.


Author here. We're using LiteFS to replicate SQLite databases in real time so the changes sent are only incremental. I think there are several ideal use cases for sharing databases across applications:

1. Internal tooling: you're able to manage the contract better since you have control of the source & destination applications.

2. Reporting & analytics: these tend to need a lot of query flexibility & they tend to use a lot of resources. Offloading the query computation to the client makes it easier on the source application.

As for database size, the Corrosion program mentioned in the post is about 8GB and has a continuous write load so this doesn't have to just be for tiny databases.


Looks interesting although personally I don't see those as compelling use cases although I may very well be missing something.

> 1. Internal tooling: you're able to manage the contract better since you have control of the source & destination applications.

This has not been my experience in anything but tiny companies or companies with very strict mono-repo processes. One big point of separate teams is to minimize the communication overhead as your organization grows (otherwise it scales as N factorial). That means you do not want a lots of inter-department dependencies due to the internal tooling and APIs they leverage.

> 2. Reporting & analytics: these tend to need a lot of query flexibility & they tend to use a lot of resources. Offloading the query computation to the client makes it easier on the source application.

Depends on the resources the client has versus the server to devote to a single query. The resources are also high because of how much data is analyzed and 8gb seems tiny to me (ie: the whole thing can be kept in some DBs memory).


Ben, fan of your work. You guys have really moved the flag on sqlite.

Are there any plans for Corrosion to be published as OSS?


hey Ryan, thanks! As for Corrosion, I can't say anything publicly but something may be announced in the near future wink wink :)


Curious if you've tested it with Jepsen. Anytime I come across some distributed system, my stomach ulcers start playing up while I wonder about all the weird failure modes. I kinda looked for a caveats page on the LiteFS web site, didn't really see one.


LiteFS doesn't try to be a correct distributed system, as you can see from: https://fly.io/docs/litefs/how-it-works/#cluster-management-...

Basically, the solution they have is:

1. There is a single writer. There's optional best-effort leader election for the writer.

2. If there's a network partition, split-brain, etc, availability is chosen over consistency.

Jepson's testing is focused on databases that pick "consistency". Since LiteFS didn't pick consistency, there's really not any point in running Jepson against it. Like, jepson would immediately find "you can lose acknowledged writes", and LiteFS would say "Yes! That's working exactly as intended!"

However, another way of running LiteFS is with only a single writer ever (as in one app, one server, one sqlite database only), and all clients as read-only replicas that are not eligible for taking writes ever. In that case, you also don't have a proper distributed system, just read only replicas, which is quite easy to get right, and mostly what this post is talking about.


LiteFS works similarly to async replication you'd find in Postgres or MySQL so it doesn't try to be as strict as something running a distributed consensus protocol like Raft. The guarantees for async replication are fairly loose so I'm not sure Jepsen testing would be useful for that per se.

On the LiteFS Cloud side, it currently does streaming backups so it has similar guarantees but we are expanding its feature set and I could see running Jepsen testing on that in the future. We worked with Kyle Kingsbury in the past on some distributed systems challenges[1] and he was awesome to work with. Would definitely love to engage with him again.

[1]: https://fly.io/dist-sys/


I could imagine this technique being useful for kepler.gl or other data visualization tools


Do you realize most reporting & analytics use cases don't use SQLLite Databases?


Do you realise that there are many reporting and analytics cases where SQLite is a great fit?


I am looking at the data analytics industry as a whole and being involved in communities of data practicioners. Most of these people use cloud DBs (Snowflake, BigQuery & co) since there's less dependencies on DB Admin type of work.

Some might be using Postgres or whatever the Engineering team provided them with, but I don't think I have really heard of a Data person preferring to use SQL Lite.

Might still be a great fit, but as the other comment pointed out, it might not be a good fit for the target audience.


The parent is saying this isn’t a great idea because target users don’t use SQLite. Your reply is it is a good idea if people changed how they do things to fit the idea

I don’t have a horse in this race the reply isn’t very well I don’t know what to make of that


I have more interest in this: https://electric-sql.com/

They are adding a sync-mechanism for Sqlite so local writes can be synced to a remote location and on into Postgres and so on. CRDT-based eventual consistency.

So local write latency, eventually consistent sync and the tools for partial sync and user access are the primary areas of development.

Early days but an interesting approach to shipping the db.


Oh, I implemented something like that for my Android app. It seems to work quite well. I don't have many users yet, though.

I replicate the clients Sqlite DB to a central server (over a REST-API) where it is synced with existing data. I use an CRDT, so changes don't get lost and as long as the clocks of all the users devices are accurate, the merges are in the correct order.

This enables offline access, but also the use of my app without an account. You can always merge the data later. Multi-device merge is also possible, if you create an account. Especially the multi-device merge was a big headache until I settled for this approach.

Since it is a home-grown solution I still have to do some manual stuff that could be abstracted away, like converting datatypes to and from JSON, Kotlin, PHP, MySQL. There's not always an equivalent datatype in those technologies.

This approach probably won't work well for big databases that share data between multiple users, though.


Are there any advantages you noticed as a dev to using this stack? Did you find there were better affordances? Were there challenges and growing pains?


It probably makes most sense if you have a middle-tier-less application where the UI IS the application, and effectively it just dumps the whole application state out through APIs anyway. We have ended up with this scenario playing out and you eventually just throw up your hands and make the UI hit the server with a giant "send me a huge blob of JSON that is most of the tables in the database on first page load" query anyway.

So the assumption that "if anybody changes anything everybody needs to know it" is close to true. In that scenario, putting a bunch of APIs in the way just makes the data less coherent rather than more. In most other scenarios, yeah, it's hard to see that it really makes sense.


I have never in my career spanning decades have I had to ask a server to send me a dataset so large that it "most tables in the database on first page load".

In what use case do you run into that?

I'm lead and an architect on an enterprise application at the moment that drives the whole company. It's your standard configuration, front-end, APIs, SQL. The system requests what it needs to fulfill only what functionality the user is dealing with.

Earlier in my career I was dealing with large enterprise desktop applications that talked directly to the database, with no centralized API. Some of them had thousands of individual desktop clients hitting a single multi-tenant SQL server. No problem, SQL Server would handle it without breaking a sweat. The bandwidth to an indidual client was fine. It was fast. And that was 20 years ago.


I did faced this scenario a couple time when working on application that would work offline, a few of those I was involved in:

- try to reduce the amount of paper based catalog we were sending out to customers, those were a couple thousands of pages and not cheap to produce, not cheap to send and would get deprecated very quickly. The web app would be pulling the entire catalog at first load so customers could go in remote location and still be able to use the catalog

- a web app for sales that was intended to be use on customer site containing all the marketing materials and much more during presentations on site without ever having to connect anywhere


> In what use case do you run into that?

Single-tenant simple sites without permission checks with limited content volume.

Of course you can also do the same in a multi-tenant environment but naturally you wouldn’t be returning all rows in the database.


This is pretty much what many SPA were/is, dump entire app state relevant on first load and then offline is no problem.


Imagine you're building a SaaS which allows your users to do create a website, hotel booking platform and channel manaager.

User can open the application, get the database, the frontend does all the offline updates the user want to perform. The user can update their website, add bookings they received on the phone, check what prices they set. This is all blazingly fast with no loading time, no matter your internet connection, because no further communication with the server is needed.

At some point they press a magic Publish button and the database gets synced with upstream and uploaded. The upstream service can take the data and publish a website for its users, update availabilities, update prices, etc.

It would be a better user experience than 99% of the SaaS out there.


Most of the Internet spent the last 10-20 years moving away from this because business metrics generally benefit from realtime updates and background autosaves. By and large, users expect systems to autosave their work-in-progress nowadays.

You might remember horror stories from the late 1900s when people would lose critical data and documents because they "forgot to hit Save and the computer crashed." Or, maybe you've never experienced this — because a couple decades of distributed-systems engineers and UX researchers made that problem obsolete.

So now we've... reinvented the Save button, but somehow needed a Paxos implementation to do it?

Everything old is new again, I guess.


Heh, the last company I worked for had a (very popular, very successful) 20 year old desktop app that worked by downloading the client's entire database on login, then doing regular syncs with the back end and sending batch updates as the user made changes. It was an awful system that everyone hated, and the company was desperately trying to figure out how to move away from it without doing a complete rewrite. Maybe I should let them know that they're actually ahead of the curve if they can just hold out for a few more years...


I feel like I'm taking crazy pills. As HNers, we should want more control, not less over our work. Auto-save means that the company, no matter how nefarious, decides on their terms when my work is saved - regardless of what I've entered into the document (I write documents in a very stream of conscious manner).

I want to decide when to save, and how, not the application. I am not a product, I am a user!!!


Seems like on-by-default auto save with the option to disable, and a separate save button, satisfy all your requirements. That’s a pretty common set of customizations on saves from what I’ve seen.


This makes me smile, because it's satire, right?


No. I value a meaningful "Last updated at" timestamp.

If I open a word document, redact a few bits, save to PDF and close it - I don't want my changes saved but they are (real scenario from last week. I'd only just moved the file into Onedrive so autosave had turned itself on)

Ditto sorting and filtering spreadsheets.

Software engineers: By all means save in the background so I never lose anything, but don't assume I want the latest changes.


What you want, then, is versioning of your documents. MS Office and Google Suite do that.

This is not trivial to implement, though. Complexity will depend a lot of the application and the data. I'd say it'll only make sense for mature apps. A startup will most likely don't consider this in its roadmap.


Hell, even with word processors we have auto save enabled by default for at least a decade.


> User can open the application, get the database, the frontend does all the offline updates the user want to perform.

"get the database"

How small do you think these databases are?!

You're going to download the entire hotel booking platform's database?

For how many hotels? One at a time, and then get another databse? Or are you getting a Sqlite booking database for every hotel in the world? And you're going to send them to each user? For what date range?

And even if that were possible, you then have to commit your offline updates. What if someone else booked the date range prior to you? Now your Sqlite copy is stale. Download the entire thing again? There could have been countless changes from other users in the time since you last got your copy.

This explanation just leaves me even more confused. It's illogical.


I have some experience with a comparable platform. In a typical CouchDB/PouchDB design, syncs and offline work are common and easy, and it's pretty close to database-based design if you get fancy with views and javascript-based logic.

For this project, I'd do:

* A database for each user (this is natural on CouchDB, one can enable a setting to create the user DB automatically when a user is created. The platform won't choke on many databases) - for some per-user data, like comments, if the user has already reserved a booking we can mirror booking data there + some info regarding the hotel.

* Common synced databases - some general info regarding the platform and hotels. Preferably not too large (IIRC there's no limit to attachments with 3.x, but it sounds risky). Anything too large we'd do online or use data provided with the application.

* A large booking database which isn't synced and must be modified online with a REST call - we don't have to allow every action offline. Here I wouldn't entirely dispense with API. This obviously needs the online component for making sure bookings don't conflict. Could even be a regular relational database.

I think it is possible to implement this the CouchDB database-way: a provisional offline reservation to the user database followed by some rejection method on the server when syncing, but I don't think there's much value to the user here. This design however would allow us to not sync all the data but a much smaller portion while supporting offline.

---

It sounds very doable, rather similar to a project I was involved with, but I miss SQL a lot with that platform (javascript NoSQL not my favorite to work with). A sqlite-based approach is an interesting alternative.


Web applications aren't going to get bigger just by themselves! /s

Jokes aside, this extreme optimization for development does have impacts on user experience. The amount of bandwidth/storage etc used by a "just ship the whole db" type applications would surely suck for most people outside of the 'I measure my bandwidth in Gbps' crowd?


Not the entire database, just your own data for your own hotel.

It doesn't sound that unreasonable.

Even if you have 3-4 hotels your data won't be significant.


You did not read the use case. It's not equivalent of Booking. More like equivalent of Wix for hotels.


Thanks for the example, it helped me understand the idea.

The thing that I'm unclear on is how do I figure out what data to ship to the user? Like if I don't already have a database-per-user then I have to extract only the data the user has permission to see, and ship that as a database?

That would be the case even if I had database-per-customer - not every user is necessarily able to see all the data owned by the organization they're a part of.

It seems like a lot of extra work, and error-prone, too (what could be worse than accidentally shipping the wrong data _as an operational database_ to a client?)

Edit: the article covers this at the bottom, but IMO it's a show-stopper. How many applications of any real complexity can actually implement database-per-user (database-per-tenant is probably not enough, as I mentioned above). As soon as you need any kind of management or permissions functionality in your app then you can't ship the database anymore, so you may as well start off not shipping it.


Once upon a time, this is how applications worked and this was a pretty good experience.

Now, you'll introduce a huge amount of user frustration around why their changes never made it to the central database. If you have users closing a form, especially if it's on a webpage, they are going to expect the write to happen at the same time the form closes. Making a bunch of changes and batching them in a single sync is going to be confusing to a ton of users.


What happens if two users want to make changes at the same time? What if their changes conflict? How do you even detect conflicts?


It could be done with "select for update" and etags+triggers checking the etag received + triggers generating a new etag on every create/update.


If everyone's working on their own local copy of the database then the select for update isn't going to do anything. The issue is later syncing and detecting conflicts. It's actually easier to do this with a centralized DB, hence why everything works this way. If an app is mostly offline then, yeah, ship it with a SQLite DB and life will be good, but for something like a hotel booking app that doesn't actually solve many problems and makes existing ones harder.


Wasn't this Lotus Notes?


What is this? A scifi novel?

By the time your dude downloads the database half the hotels on your imaginary website have changed status.


A lot of SPA’s already operate this by leveraging things like localstorage in the users browser.


To do what? Isn't this limited to 5 MB?

I just checked a heavy user for our enterprise SPA. localStorage is using ~12 KB.

I don't know what people are just throwing in localStorage but they certainly aren't pulling down much of the database (which still would have to be checked to ensure cached data in there isn't stale).


> Because in the last 30 years I've not run into a situation where I needed to have clients execute SQL queries on tiny, static databases let alone still need to potentially update them also.

It was not uncommon in the early 2000's for applications targeting mobile professionals and sales types to work this way, except that instead of SQLite they used things like Notes¹, Jet² and MSDE³. By modern standards these would be considered "tiny mostly static databases" often not exceeding a gigabyte. Instead of connecting over the internet they used serial dial-up file transfer protocols to synchronize with a central database or a mainframe. People would typically download information in the morning, make updates offline and synchronize at the end of the day. A slow trickle of periodic bidirectional updates insured everyone had a reasonably fresh copy of the information they needed.

1- https://en.wikipedia.org/wiki/HCL_Domino

2- https://en.wikipedia.org/wiki/Access_Database_Engine

3- https://en.wikipedia.org/wiki/MSDE


I think the point is basically "unless there's a good reason for your API to look different than your DB schema, ontologically speaking, then the schema has already effectively defined your API and just let people interact with the DB", or, alternatively, "bake in as much of the data constraints as possible into your DB schema, and only when you can't enforce them with column constraints or DB extensions should you add APIs on top".


I have seen some stuff where a streaming tool (Kafka or whatever) is used to just ship all updates to a database to certain clients. But I think this is a dubious architecture since it comes with basically all the downsides of a database that many applications all want to use besides the write contention one.


Supabase realtime handles this really elegantly


Is this a true cloud managed SQLite or is this like their PostgreSQL documentation where it's just a bunch of pre configured docker containers and the developer is expected to manage everything themselves? If the db goes down for an extended period of time at 3am, does fly.io have an engineer on call?

I find that fly.io has been a very disingenuous startup. They position themselves as some sort of Heroku successor and hide behind their developer content marketing while providing very low quality infrastructure. At the end of the day, fancy blog posts may win hearts and minds, but downtime cost businesses actual money.


I think you are thinking that Fly is like a serverless platform. They aren't. They are the opposite. They are a server platform. They provide server for you and you have to manage your server yourself.

Nothing they provide is managed by them. You have to do that.

LiteFS is just a replication service for your sqlite database so you can keep your database synced across multiple nodes.

https://github.com/superfly/litefs

LiteFS Cloud which is the a service they provide just helps you backup and recover sqlite databases. You can do this yourself.

https://fly.io/docs/litefs/backup/


I go into a project folder and can run "fly launch" and then the application is alive in multiple regions around the world. Where exactly am I manging this server?


Open up the Dockerfile they generate for you.


How does a Dockerfile change things? Your application is packaged into a container. The Fly service runs any application not just web applications or build artifacts. Of course your application is going to get a complete operating system environment to run in.

I don't think the Dockerfile has anything to do with them pulling your container from their registry and running it on their server.


Fly.io does not run containers. They convert your docker image to a VM.


Ah right okay thanks. The point still stands the Dockerfile and image is used as a container for your application and does not imply a manage-it-yourself server is being used.


So a scaffolding tool?


There is no such thing as "serverless", that's just a higher than normal level of vendor lying.


I said nearly the same thing decades ago. Smashed a few friends’ so-called “wireless” telephones and showed them all the wires. Just look out a window; what’re those long, wiry things running between the utility poles, huh?! It’s ridiculous, the level of lying.


I think it's "lying" if you mislead another person.

In those two examples: serverless and wireless phones, it's not misleading consumers.

Serverless is a way to communicate that the consumer won't need to provision and maintain servers. It is true.

Wireless phone means you can move for quite long distances while talking without any wire connecting your device to the wall. It is true.

You may argue they're "misnomers". But to call it a "lie" is quite a stretch...



The first sentence in your link includes "in which the cloud provider allocates machine resources on demand, taking care of the servers". =

There is NO SUCH THING as serverless. There are always servers.


Dear God this is a tired point. The rest of the damn paragraph of the Wikipedia article fully addresses that. Nobody who uses the term "serverless" thinks that magic fairies are replacing servers. The point is who is managing it for your application or business.


Serverless doesn't mean servers don't exist, any more than calling a radio a "wireless" doesn't mean wires don't exist. It just means for the consumer, servers aren't a consideration, just as with a wireless wires aren't.


The irony is that many things that are advertised as wireless actually need wires, and things that aren't, don't.

My wireless printer needs a wire for power.

My dog doesn't need a wire for anything, but no-one talks about them being 'wireless'.

(Come to think of it, my dog is serverless too...)


> My dog doesn't need a wire for anything, but no-one talks about them being 'wireless'.

Indeed :-) But isn't this the same sort of definition you want for "serverless"? Not something relevant to consumers (e.g. "you don't manage servers") but a description of something where servers are not involved in any way?


> you have to manage your server yourself

Can you give an example? You mean backups, OS patches, firewall config, etc etc? Are they like Hetzner?


Author here. I think we could have set better expectations with our Postgres docs. It wasn't meant to be a managed service but rather some tooling to help streamline setting up a database and replicas. I'm sorry about the troubles you've had and that it's come off as us being disingenuous. We blog about things that we're working on and find interesting. It's not meant say that we've figured everything out but rather this is what we've tried.

As for this post, it's not managed SQLite but rather an open source project called LiteFS [1]. You can run it anywhere that runs Linux. We use it in few places in our infrastructure and found that sharing the underlying database for internal tooling was really helpful for that use case.

[1]: https://github.com/superfly/litefs


Can second this. Tried them out at our startup, came away unimpressed. Their good engineering blog is great marketing though, although this particular post is lacklustre. What about permissions, indirections, versioning or otherwise encapsulating breaking changes?


I was just setting up a new Fly account and launching a project yesterday. I will say their developer content marketing is very succesful and had me convinced they at least know what they are talking about.

Any more stories or knowledge of them being disingenous?


I wouldn't say disingenuous but uptime has been disastrous.

I'm definitely not buying anything there.

I run a free service on fly.io.

It's been ok lately (or maybe my users just stopped complaining), so maybe they got better.


But it's free.


If the free juice they give you in the supermarket tastes bad, will you buy it?

We assume the free version to have similar quality to the paid one. It's in the vendor interest, so that we like it and become a paid customer. If they can't convince us of their quality in the free version, how are we supposed to trust the paid one?


Free can be the most expensive price of all.


Their DX is good. They just have a lot of reliability/availability issues.


A very interesting idea to be sure, but IME the biggest downside (which tbf is mentioned in the article) is the contract. If you have clients with knowledge of and dependency on the schema, you can't change it in a breaking way unless you update all the client's code.

I've tried various patterns in the past like one that just exposes database columns as an API, and this pain point always comes calling and it hurts. Keeping your data model simple and lean as possible is an important part of limiting complexity, which directly correlates with maintainability.

The only pattern/approach that I consistently return to is the Rails pattern (I use Elixir/Phoenix now but same pattern). It is certainly not the most sexy, and having a client be able to graphql exactly what they need can be really helpful, but at least for me it has rarely turned out to be worth the tradeoffs.


This is basically how Firebase is designed and it drives me nuts. I've freelanced with several teams, primarily comprised of front-end devs, who decided to use Firebase for their product. When they first told me they query the database directly from both their website and their mobile app, I immediately was like "so...what happens if you need to change the structure of the database"?

Crickets.


Crickets, really? It's very obvious how to do as recommended by Google - you use a Cloud Function (aka Cloud Functions for Firebase) to do a "migration", while simultaneously handling both cases in client code until this is done.

You must have been working with a bunch of amateurs if such an obvious solution (even without reading the documentation) didn't come to mind apparently by you or your clients (since you're the freelancer). This particular problem isn't even Firebase specific - you have to do this for any store that doesn't strictly have a schema.

This issue you're describing isn't even in the top 5 biggest issues about using Firebase.


As I stated in my comment, I was working with front end developers, who had little-to-no experience working with backends and hadn’t considered migrations. That documentation really didn’t help because it’s such a laborious and error-prone process, especially when dealing with mobile applications that live on users devices. My implied point was that an API layer resolves the need for any of that bullshit to begin with.

And yeah, there are tons of other issues with Firebase, but this was the first warning siren that went off in my mind. I hate Firebase.


The solution to this is the same as with APIs: versioning. Instead of naming your table "my_foo", you name it "my_foo_v1". Then, when you want to make a breaking change to the schema, you:

1. Create a new table "my_foo_v2" with your desired schema

2. Modify write queries for "my_foo_v1" so that they also write to "my_foo_v2"

3. Copy over existing data in "my_foo_v1" to "my_foo_v2" with a migration script

4. Modify read queries for "my_foo_v1" so that they read from "my_foo_v2" instead

5. Remove all write queries to "my_foo_v1"

6. Drop the "my_foo_v1" table


Holy hell does that sound awful. In practice, it doesn't even solve the core problem as you'd have to deploy new versions of all binaries precisely in sync to avoid data incorrect because some clients know about the v2 table and others don't. It becomes indistinguishable whether a row is missing because it was intentionally deleted or because it was written by a client who didn't know any the new table version. There are ways to account for this but it's layers of crap on top of crap.


> it doesn't even solve the core problem as you'd have to deploy new versions of all binaries precisely in sync to avoid data incorrect because some clients know about the v2 table and others don't

its painful but this approach does work as long as every client is migrated between each step.

after 1: v1 is valid, reading v1, writing v1

after 2: v1 is valid, reading v1, writing v1 and v2

after 3: v1 and v2 are valid, reading v1, writing v1 and v2

after 4: v1 and v2 are valid, reading v2, writing v1 and v2

after 5: v2 is valid, reading v2, writing v2

after 6: v2 is valid, reading v2, writing v2

at each point, both the current and the previous version are reading a valid table and writing to a table whose values will make it into v2.


Good luck migrating every single client in the real world (unless you are willing to disable old clients & force an upgrade if they get too old)


These are the same problems you have with any API or package you ship.


Is there really any alternative? Data migrations are always going to be terrible. You need to slowly roll out the incremental changes so that everything is backwards compatible until you the threshold point where you can finally delete the historical route.


With an API, you can often create facades and other patterns to avoid having to migrate the underlying data


You can often also accomplish this with a database view.


In a sensible approach schema migrations are totally decoupled from the API.


With postgres, simple views are updatable, so you can often do this pattern without copying tables or dual writes. It's particularly useful for renaming columns, but you can also do some other things. You also don't have to use version numbers on all your tables since you only need the view temporarily.

    ALTER TABLE my_foo RENAME TO my_foo_tmp;
    ALTER TABLE my_foo_tmp <your desired change here>;
    CREATE VIEW my_foo AS SELECT <how you make your new table seem like your old table>;
    COMMIT;
    <update your apps to use my_foo_tmp>
    DROP VIEW my_foo;
    ALTER TABLE my_foo_tmp RENAME TO my_foo;
    CREATE VIEW my_foo_tmp as select * from my_foo;
    COMMIT;
    <update your apps to use my_foo>
    DROP VIEW my_foo_tmp; COMMIT;


We actually used a similar tactic when transitioning our data to supporting soft deletion. We have a limited number of queries that update data or present data for administrative review and processing - but a lot of wild ones around reporting. We'd rename table `foo` to `foowithdeleted` and create a new `foo` view that excludes any soft-deleted rows. Our reporting queries keep on trucking like normal and read out of the `foo` table (now a view - but they don't care) and we only need to adjust the administrative view to show soft-deleted rows for undeletion operations.

Shell-gaming tables with views can be incredibly useful in postgres.


You know, I've been toying with this very idea in my head for a while, as a way to make safe schema migrations. But I also know very little about databases so had no idea if this was a stupid idea or not. Glad to know I'm not alone.

One issue I've considered is fk relationships - that could get complicated depending on the approach.


Schema migrations are a solved problem. You make backward-incompatible changes in multiple steps.

https://engineering.edx.org/django-migration-donts-f4588fd11...


Imagine you have Android & iOS apps which some of your users won't ever update. You'll never be able to drop any old version's tables, and you'll need to keep all table versions in sync more or less in real-time. Only when the last user has updated from a certain client version will you be able to remove that version's tables.


I'm sorry that is insane. That approach wouldn't pass the first sniff test of any sensible technical design review.


You'd be surprised how many companies do exactly this stuff and not just with DB schema changes. They're called bridging changes.


I think about this with Postgrest/Supabase which has a similar problem (autogenerated code based on the db schema is inherently going to conflict with api stability guarantees). I think that this approach is just fundamentally at odds with making a stable versioned controlled API.

However: I think the best thing to do if you want a setup like this and want to have your cake and eat it too is something like this:

Use your autogenned client for your “internal” api. This is for your clients with the autogenned schema that you directly control only so that you can ship changes and not have to worry about versioning/backwards compatibility.

Then for external users that have their own clients, you have that slimmed down more traditional API that offers less functionality but it’s properly versioned, tested etc

I think this kind of hybrid setup can work well for SaaS setups where you have a cloud product that does internal stuff plus things external that end users need to operate on. You get the benefit of being able to iterate quickly without breaking your clients and since your external API is smaller it’s less maintenance overhead to keep it updated and versioned


PostgREST docs have always recommended to use only views + functions in a dedicated schema for the API: https://postgrest.org/en/v10.2/schema_structure.html


Huh, interesting, since at least with Supabase the default setup kind of pushes you towards or at least doesn’t discourage you from using tables instead


They mention one partial "solution" to this: have your clients query a view, instead of tables directly.

When you need to change the schema, also update the view accordingly.

Of course, there are still limitations. For one: you can't "massage" data with a view as you could in the backend with a full featured programming language.

PS: I don't think this is a good solution, just mentioning it.


If you are in the situation where you are shipping a web app where one dev team controls the backend and the one and only client and ships both together, is this a big issue?


I think there are some reasonable ways to avoid the schema brittleness that people are concerned about with this.

Firstly, acknowledge that in this model your schema IS your API. Don't include anything in the database that is being replicated that isn't considered to be documented, tested and stable.

With that in place, API changes become the same as changes to a JSON API or similar: you can only make backwards-compatible changes, like adding new columns or adding new tables.

If you do need to make a backwards incompatible change there are ways to patch over it. Ship a brand new table with the new design, and include a SQL view that matches the name of the old table and exposes a compatible subset of it.


Maybe it's better to have views in the first place, and to require (though it can only be enforced at an administrative level) external services to query against those views.

We won't have to do this switch to a view, we've paid that upfront.

We can then include a version number in the view name, similar to how we do with APIs. So if we introduce changes we can keep services which didn't need that change outside the blast radius.

And we have a limited ability to enforce contracts and declare some columns internal.


Even better, use stored procedures / routines on top of views and only query the stored procedures from the frontend.


I like it, but the only benefit we've retained at that point is eliminating round trips. Which is pretty significant. But we're still engaging in API design, we're just doing it in Sqlite.


That is an appealing idea but on second thought, haven’t you just shifted the problem to: how do I reliably replicate logical updates from my operational tables (internal, constantly changing) to my clean, public-facing tables?

Basically, your public tables become materialized views over the “dirty” tables, and I’ve never met a mature, reliable implementation of materialized views, or even an immature one that is feature complete. (I would love to be wrong on the internet with this one!)


That's just working around an artificially and unnecessarily introduced constraint. There's little zero upside to doing it this way. It's not even more convenient since it incurs pretty heavy refactor costs for even small changes.


This is correct, but there are very few people with experience with safe database modelling like this, and they may not even realize the technique exists.


The same is true for APIs, too. Making changes to a JSON API that won't break existing clients isn't necessarily an obvious skill.

But it can be learned, as can backwards-compatible database schemas.


I have been thinking about this a lot lately. We have customers who run automation against us. They’re doing 10,000 API calls. Each one has to do a permission check, request signing, serialization, etc., etc. All just to mutate a blob of data less than say, 100MB. If they just downloaded the whole state, mutated it, and reuploaded it, the whole thing would be done in 2 seconds.

We already lock the entirety of each customer’s data when processing a request, in order to avoid data inconsistency from concurrent mutations.

One SQLite database per customer is a really appealing idea.


I'd think 100x before deploying this pattern to external customers.

The use case shared by fly.io is for internal tooling. If you have a small team that most likely won't grow to more than a few dozen devs, maybe you can get away with it.

Shipping it to several external customers will be a horrible nightmare in maintenance and dev productivity. I'd switch jobs if I was working in a project that started applying this. Paying for the HTTP API upfront is a cheap price to avoid the costs of not having it down the road.


> If they just downloaded the whole state, mutated it, and reuploaded it, the whole thing would be done in 2 seconds.

Do you think your customers would accept this as a test? They would run against an in-memory database if they wanted to.


A downside I didn't see mentioned (it was gestured at with contracts and the mention of backwards compatible schema changes, but not addressed directly) was tight coupling. When you link services with APIs, the downstream changes of a schema migration end at the API boundary. If you are connecting services directly at the database level, the changes will propagate into different services.

That would make me very nervous to ship something like this. You can probably cover a lot of this with views as a backwards compatibility shim, though.


Author here. I don't see this is a general practice to be used for most applications. It was a side effect that we came across and it allowed us to share data between internal applications in some interesting ways. Internal apps are ideal since you have control of both sides of the contract. It definitely requires some more consideration if you're publishing out to end users.


Hi Ben, thanks for taking the time to respond. It's an interesting approach, and I'm sure that it has it's place & you've made the right call for your particular situation.

I didn't think you meant pushing it out to end users, my point was more that this technique increases the blast radius. If three services are touching this database, and our changes now necessitate three deploys, that's much higher risk.

But it was an early comment, I hadn't gotten to see the suggestions for managing this coupling yet. Each one of those does take us a step closer to designing an API, but perhaps part of the value here is that it allows you to move along a spectrum and to adopt the level of constraint you can afford, given the situation.

I'm curious if this has spread to any other services at fly? Or is it just this Corrosion service?


I think a spectrum is a great way to look at it. If you're adding one service then it's no big deal but as you add more and more then you get a better sense of the requirements of the API and I think you're in a better place to build one out then.

We've used this in a few other instances but Corrosion is the largest database we've done it with.


I love this kind of reasoning that forces you to climb out of the comfortable nook in a tree of thought you’ve been occupying for years to see if there’s a better vantage point!

So would this work equally well? Use Postgres on some cloud and “spin up” a read-replica that your clients have access to. Let them read from the replica.

If yes, I have tried that and I can tell you the problem I had. They don’t want to get to know my database schema. I know, frustrating! “If you would just get to know me better, you would have all the secrets you desire!” Sigh… they just want this overly simplistic Interface that doesn’t change, even when we rewrite things to make them better on the backend.

Another thought: if sharing compute is the main problem, you can just egress the data to distributed file storage like S3 and let them write SQL on the files using some Presto, like Athena. Or egress to BigQuery and let them bring their own compute. But the problem there again is getting to know my internal data structure, and also by “egress” I really mean “dump the whole damn thing because once the data is on S3 I can’t just update a single row like I can on an OLTP database” and you can only do that every so often.


Yes, Postgres replication would work as well. I agree that the "getting to know the schema" part is an issue. I think there's use cases out there where you have power users that would gladly invest extra time in exchange for query flexibility.

Querying to S3 is a good approach too. Those query tools can be somewhat involved to setup but definitely a good option.


It sounds like this is essentially database replication, but perhaps cheaper, since it's a SQLite file with a schema hopefully designed to contain whatever the client needs rather than random junk. You'd still have to send writes to one location and they'd fan out from there, with some latency.

It seems like it would be a good fit with edge servers, assuming any node that starts an instance could get a copy of the database when it starts. Most database-as-a-service companies don't do that.

I doubt I'd try it if it isn't a fully managed system, though. The only one I know of is Cloudflare's Durable Objects, and it's not in the free tier. For hobbyist programming that's kind of a drag, so I haven't tried it. Instead I'll probably try Deno's KV. (This sort of thing is what I hoped KV would be, but apparently it isn't.)

I wonder what the cold-start performance would be for downloading a small SQLite database from S3, starting it up in memory, and subscribing to a replication log?


I've written about this solution in the context of running LLM-generated SQL safely. I call it the "cloned database view" solution[1].

The author touches on 2 issues that are tangential to this problem, restricting unauthorized data, and mutations, both of which are not really easily solvable with the cloned database solution.

However, the underlying theme of skipping the API is extremely compelling. So much code, both FE and BE, are just to serve as an elaborate interface to the database. Simplifying this will be a huge gain in software maintenance. But the challenge is to do it safely.

1. https://docs.heimdallm.ai/en/main/blog/posts/safe-sql-execut...


The pattern here involves database replicas, not clones.

It's not for a single-shot test, but for continued use with data synchronization.


From the post:

>A cloned database view is some representation of your database that has been post-processed to only contain data that the user is allowed to see. This representation would typically be in the form of a separate user-specific sqlite database file. Because the cloned database only contains data that is theirs, a user may issue any query against it, and there is no risk to data security. And if a malicious user was able to issue a DDL statement, it would only affect their cloned database.

Maybe there is a better word than "cloned database view." Replica doesn't feel quite right though.


I think your use of database clone looks fine in the context you write about. I don't think it applies to the context fly.io is writing about. Your post is about something that seems like a different pattern, I think.


In the beginning of my career, I did an internship at a very small company that had been selling a desktop app for 20+ years, among other stuff. Each customer got their own database.

The software was pretty stable and the team were part time consultants for that app, which mostly involved helping users perform complicated queries and creating stored procedures (they loved those). Subscription was allegedly quite expensive and had minimal churn over the years.

For that kind of hyper stable software where each customer get their own database, I can see it working. In your typical 2023 SaaS where you are serving 1000s of users through a complex API and a database, I'm not sure a db can give you the level of flexibility and quality of life an API gives you.


The lifecycle of a startup full of young arrogant naïve inexperienced souls:

1. Why is everyone doing all these complicated things?! We stopped doing them! We're much smarter than everyone!

2. Few years later.

3. At great financial, technical and business cost we redesigned our system to do all those complicated things, because our initial version was an unmaintainable spaghetti monster that wrapped up everything its tentacles touched and hugged it until it died.

There are situations when exposing SQL is a right call. I'd say those are pretty narrow, say when you have colleagues in a partner company doing integration. SQL is not a way to avoid having an API. SQL is an API. The surface of this API is gigantic, and it also binds your customers to a particular implementation of your database (the SQL vendor you happen to use). If those customers are few partners, this is easier to change. When it's the public at large... you're effed. This is all pretty basic systems analysis, but kids these days don't know the first thing about it.

But here's a basic rule of thumb: exposing SQL as your API may be the right call if you've spoken in person with every customer who will be using that API, and especially if this in-person contact is regular, say at least once a month. If you haven't, then SQL is not a good API for that customer.


This is 100% the pattern. However most startups (and most software projects if we think about it) fail. They don't make it to year 2.

This in built failure rate shouldn't be a reason against doing things right the first time but it might explain how many don't get to learn the lessons from year 2. Most of the time we will never get to redesign and realise the mistakes we made.


Sounds like the next step is some way to define a multi-table view over a more traditional RDBMS (e.g. Postgres), serving as a source of truth, that can be continually exported as a read-only SQLite, so that the SQLite file can be pushed to the edge and read from the edge?

I'm not sold on the idea of per-customer databases; businesses still need analytics, so if you're not running your analytics on the centralized DB itself, your app needs to write/push the data somewhere else, which is an unnecessary complication early in the product lifecycle at best and a recipe for frustration at worst.


remember the times when software companies shipped software to users, and users used the software offline. No one from business even dreamed of knowing what each user was doing with the software at any moment in real time. What data users stored etc. Today it looks like this is MUST have for any software and no bussines is possible without this.


> No one from business even dreamed of knowing what each user was doing with the software at any moment in real time

Surely you jest. Not knowing what your users are doing with the software was the number one problem with product development in the 90's. The only solution was user interviews and real-time in-person observation, which was highly expensive, thus making the cost of software much more expensive.

> Today it looks like this is MUST have for any software and no bussines is possible without this

Because more and more software is delivered over the Internet through low-touch sales methods. More often than not, software producers don't even know who their users are - how do you conduct user interviews, expensive as they may be, if you don't even know who your users are?

Trying to run a modern business without a modern analytics stack is functionally equivalent to driving with a blindfold on.


Just want to temper this a little bit.

Yes, information can be better than no information, but user analytics is not the holy grail it pretends to be and can be worse than no information at all.

I led a team a BigCo for some years that had access to mountains of highly detailed user data that enabled product, eng and UX folks to ask very specific questions about how successful various changes were, in terms of our business and product goals.

An outrageous amount of time was invested in this process. Designing, describing and negotiating experiments, collecting data over some days or weeks and in almost every case, the data was basically inconclusive and every question would spawn more questions. Enormous time was spent digging and sifting through this data, while even people with the best of intentions would almost always disagree about what the data was telling us.

Eventually, some quant showed up and explained how exactly zero experiments we ever did were statistically significant, and that if we wanted to achieve significance, we would need to run a 1% experiment on our many millions of users for more than 30 days.

The organization collectively shrugged at this information and went right back to the time worn rituals of success theater.

What I learned in that experience was that developing great products requires experience, skill, intuition, grit and visibility, but that visibility might be the least critical of those requirements, not the most.

Put another way, there is almost always a simpler, clearer measure of success that doesn't require spying on everyone all the time and then sifting through what is actually worse than useless information.


Geez, yeah. I was working at a mid-sized company's ecommerce site. And even with the amount of users we had hitting our site daily, in order to get to any type of statistical significance to assure us that we weren't wasting our time, we had to wait about 21 days. That limited our cycle time.

My boss ended up telling me that I was moving too slowly, and I needed to move on with the experiments. I was surprised, since he should be smart enough to understand statistical significance, but here, he turned a blind eye to it.

A/B testing is only the way to go when you can gather millions of samples. For web sites, this is more likely at big companies.


No need to get data at that level of granularity. "How many users do we have?" "Who is our biggest user?" Simple questions like those which are simple to answer vs a centralized database, and much more difficult to answer when the databases are distributed.


To me, this looks a lot like Sequin (https://sequin.io), minus the ability to write back in the other direction (but plus open source!).


I like the idea of exposing the database and pushing more business into the database.. but it's terrifying.

Row level security and exposing postgres to the world sounds cool, but I can't see myself doing it.

It's like... encryption is awesome in theory and protects your data, but would you push your password-store to a public github repo? In theory you could...

edit:

A quick Google-fu and it looks like some people do push their password-store to Github... balls of steel.

https://github.com/jysperm/passwords


This looks interesting for read heavy distributed apps. I.E fly.io/aws lambdas/k8s.

Each instance gets its own eventually consistent read only copy of the db. So instead of making a network request to go get some data, you query your local cache.

Your use case is that you are read heavy for this to fit your needs.

If you have a copy of the data local to you, you don’t need to go bug another team to expose it via an API.

Reading fly’s blogposts it’d be good to see more on write durability. They’re batched on the client, batched on fly’s side before being compacted to s3 from reading so you can loose writes before your client as sent them to fly, possibly loose them before fly has written to s3 depending on how they implemented.

FlyCloud might make a great service for small typical Wordpress type sites or low user services where writes are low. It’s a shame they don’t offer pay as you go for the LiteFS product only. I get why they don’t as they want you to host with them but it’d be great if you could pay as you go on the db only.

I currently use DynamoDB for low volume stuff for cost reasons, it costs me cents. Productivity is extremely low though with Dynamo as you need to think very hard about table design and constraints and work around those constraints. It’s a lot more effort than using something like SQLLite for quick simple stuff.


You can use LiteFS Cloud without hosting your app on fly.io - https://fly.io/docs/litefs/getting-started-docker/#litefs-cl...


As someone who has worked as a data engineer on an organically grown monolith, the last thing you want to do is ceding control over your schema due to the lack of (sensible) abstraction. Everyone will change it, nobody cares about standardization. It’s utter chaos and a data engineering nightmare.

This is the same, plus the lack of proper authN/ authZ. Good luck passing your next audit.


A very interesting idea. At Yazz we don't do replication, but if someone creates a report in the tool and exports it as HTML then we actually export the entire app DB into Sqlite embedded in the HTML page, but any changes made to that embedded Sqlite DB don't get replicated to the original database, so kind of like a read only copy, only suitable for reports


There also is a productivity boost within this approach.

When you have a local DB as an integration point, you could just upload it to ChatGPT Code Interpreter for building extra reports and running analysis on it.

Alternatively, upload the schema to ChatGPT and “write these few methods with unit tests for me, please”.

I’m not saying that this approach is universal, but it saves me an hour of time now and then.


From the perspective of someone who's been neck-deep in a highly complex PostGraphile project for several years now, I wholeheartedly and strongly recommend that you SHOULD NOT expose your database or autogenerate an API from your database. Maybe it's fine if you have a really basic schema and the API requirements are just a few CRUD endpoints with no business logic, but as soon as you go past that you'll find yourself in a world of pain.

One upsetting side effect of giving API consumers the ability to query your data arbitrarily, is they will absolutely find the most inefficient and resource-intensive ways to do it. And the blame when the "API" is taking several minutes to come back with data will be laid squarely at your feet even though the cause of the problem may not be under your control at all.


Kudos to the author(s) for building LiteFS, it really does seems like an interesting project. However, I can't see this being a popular pattern moving forward (or at least I hope that's not the case).

Replacing APIs with DBs in real production workloads? Really? I don't have much else to say. Also, the value proposition is not quite clear to me. Using replication to have local, read-only replicas of an SQLite db might admittedly be beneficial for high read-throughput workloads but I'd imagine that the DB might have serious size/read-throughput constraints due disk space and network overhead from replication.

My takeaways are: a) This doesn't sound like a reliabile pattern to build distributed services. b) It's bad PR for fly to publicly push for the adoption of this pattern. c) It's even worse PR that fly is using this internally.


Replacing APIs with DBs is a good idea. I don't have much else to say.


It's certainly not a new idea, however. Every product that allows backdoor access into their database directly has effectively done this. Hope you have your schema correct when your customers build a mountain of cards on top of your database schema.


How does this differ from them building a mountain of cards on top of your API?

The schema used by the public API doesn't have to match the database layout itself. E.g. views can be used for this.

A lot of APIs are quite straightforward mappings typically from REST/GraphQL to a database, often via some sort of ORM or other abstaction. These extra layers mostly amount to needless complexity, more surface of bugs, lots of busywork and worse API usability.


It differs dramatically. Your API is likely much much smaller than the inherent API of SQL tables. It's a difference in degree that is a difference in kind.

Anyways, every old idea must be retried.


I don't see much difference between a set of SQL tables to a set of API endpoints that let you list a set of "objects". At least in read-only cases with no sensitive data.

Of course there's the big difference in that with SQL you can compute and join tables server side, sparing a lot of traffic of doing this manually (or coming up with ad-hoc ways of doing it). It's not like the SQL itself is gonna substantially change.

The schema can be restricted or kept compatible with e.g. views as mentioned before (although I'd go more for the consenting-adults philosophy here too). Too heavy queries can be limited with e.g. query timeouts, quotas or query priorities. And you usually have to do such things somehow anyway.

I have hard time seeing major concrete problems with the approach that don't have existing straightforward solutions. I do see a lot of unfounded FUD.

For writes and sensitive data it gets trickier, but for many cases it would be probably easier to secure the database interface than to do it ad-hoc on a "business layer".

I haven't heard the idea of public facing SQL querying floated around much in at least over 20 years. When it's proposed, the response is usually knee-jerk no, as seen in most comments in this post.

Also I'm not much of a fan of SQL itself but I'm against the current software development philosophy that computing should be made as restricted, layered and inflexible as possible.


"Skip the api" - seems to be stolen from sequin at: https://docs.sequin.io/integrations/airtable/playbooks/metab... (a few scrolls down)


Author here. I hadn't heard of Sequin until today. "Skip the API" was just a snappy title so I went with it.


I'm curious the types of users this has. This feels like something that small to mid size companies may leverage but anything slightly more sophisticated would typically require access controls, merging data with other sources, etc. Doesn't feel like a very scalable appraoch.


Author here. We do this with some internal applications that share internal state -- no customer data. I would expect stricter restrictions depending on the type of data shared and how bureaucratic the company is.


While you're at it, skip the internet and burn the database on DVDs, then distribute them!


Even if you "skip the API" you still have to think about access patterns for the data, otherwise client-side query performance is going to be abysmal. You would basically have to define your "API" via what indexes you create on the data.


With a local SQLite database you don’t need to pay cost of network transfers.

A dozen of local SQL calls could be way faster than a single API call within a region.


That's true, but low query latency doesn't mean much if you have to do a full table scan for your query. Some thought still needs to be put into how the clients will query the data, so it's really not as simple as just "ship your entire database to the client and let them figure it out".


Is this spiritually just like using a Redis cache?


> The typical approach to making data available between services is to spend weeks designing an API

What?

No.


Distrust is building for Fly. Their articles ooze arrogance, whether it's hyping up their "unique" hiring methods or their "managed" Postgres service. They even gloss over serious flaws like data loss in LiteFS. Their latest article is another red flag for me. Started off intriguing, but now, I wouldn't trust them as far as I can throw them.


Author here. My goal with the article was to write about an use of LiteFS that I found to be useful and show the benefits and trade-offs. I don't think it's a general purpose technique for everyone but I think it has its place in some cases. What was it about the post that oozed arrogance?


Musing about APIs and the constant complaints about "abuse" and "bots", I think it would be an interesting experiment to set up an API where when a user hits the daily limit, instead of just being "blocked" (which according to complaints does not really work due to use of proxies), the user is sent a link to the database for download. The daily, updated database could be hosted on something like Backblaze. Would this satiate the demand for the data in a more efficient way than an "API".


> Moving compute to the client

This is a great point and the reason why I think server-side rendering will ultimately fail.

I've done something similar for a site I manage (6groups.com). Each client's DB log is stored in compressed files on the server and only initialized via sqlite wasm on the client when needed. The trick is to periodically run compaction jobs (also on the client) to make sure DB logs don't grow to unreasonable sizes

I can't imagine the DB consumption I'd have if all client data was in relational format on the server


Isn't this effectively just letting all services access a central database? E.g. letting them all run SQL queries against Postgres?

Seems like overengineering to replicate the DB to each machine when we have Postgres with read replicas or dbs like cockroachdb.

Like why replicate your 10GB DB to all 30 of your machines when they can access a 3x3 DB cluster instead? Seems like it's asking for issues, as now your machines are no longer stateless and have to play catch up if they go down before they can serve queries.


I think, I found a good use case for this.

I'm running an gateway that uses a SQLite database to index transaction data. One instance loads all TXs and indexes them to SQLite and LiteFS replicates the database to other instances.

Down-side: there are copies of the database on every machine.

Up-side: Saves work, since only one machine needs to index and reduces latency for reads, as every machine reads from a local SQLite replica.

Bonus: LiteFS Cloud keeps backups of the database.


I was really excited about fly.io at one point but after reading all the recent posts(1) on HN I don't have trust in their services anymore. Hope they are working on improving system stability first.

(1) https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


A database isn't designed to be used this way. You'll eventually need it to do something "extra", and your company will probably find some workaround or hack, but eventually make an API.

APIs are an intermediate layer of abstraction needed because their purpose is separate from that of a DDL/DML/DCL/TCL. Having raw database access does not solve the use cases the database doesn't account for.


This is not equivalent to (say) a single Postgres database where we get shared state with transactions between all users - or have I missed something?


I've tried doing this read & updates using Postgrest and row-level-security using Supabase. When it works its an amazing experience but even for semi complex stuff you would still need to use Postgres RPC, which is still another "API" layer. I find writing API using SQL a nigtmare.

Simple queries like this don't work on Postgrest: `update likes set likes = likes + 1;`


I wonder why they chose to implement the LiteFS as Fuse module, instead of simply hacking SQlite to intercept the IO operations before they go through to the OS.

Yes it would require a patch and a recompile, but performance would be superior by not having to go through the kernel and Fuse, and there would be fewer moving parts.


That's what the in-active-development LiteVFS does, except using a SQLite extension: https://github.com/superfly/litevfs


LiteFS works this way so that its users can use the unmodified SQLite libraries they'd normally use.


Surely you will not ever regret exposing all the internal representations of your data to your clients.


It depends who your clients are. If clients are public/external users, probably not.

If clients are internal services you own and you control as part of a internal distributed service, it’s a nice pattern


Even then it's often a choice you will regret down the line as you wish to evolve away but it becomes a huge ordeal.


It seems like sqlite is now acting as the API gateway. I think its an interesting idea, especially if you want to share relational data out to consumers, since they now have all the flexibility of sql. The issue though is updating the queries when schema changes are made.


Even for more some more advanced use cases such as OLAP or ML related pipelines, in my experience, it takes a single senior developer a couple of days to design and implement a REST API, not weeks. That claim is overblown FUD.


Exposing your schema directly is an anti-pattern for good reasons. Don't do this. You want to be able to change a contract/interface/API independently of your persistence strategy.


You might want to. You don't have to. It's a tradeoff, not a law.


I've seen teams take this ad literam and actually shipped an Elastic instance in an iFrame as a "product". Needless to say users abandoned it in no time.


If I can't update the data and I need an API to do that, then it's a static site. I don't need a database in the first place.


> Moving compute to the client

I thought we were slowly learning that this was a stupid idea in the first place and moving back to the old ways (tm)?


This must be from a person who never had to deal with when the database actually was the API.


What's the comparison of this approach VS directus, which auto creates an api layer on top of dB


What happened to keeping things simple and scaling them?

Or did that train leave a long time ago?


I asked chatgpt for a solution looking for a problem and it sent me here


Skip the database. Pass a gaint json document back and forth.


An XMLRPC by any other name would smell as rancid.


How would I serve terabytes of stuff?


Author here. It's not meant as a one-size-fits-all approach. It was an interesting side effect that we noticed that we used internally so I wanted to share my experience. The database mentioned in the post is about 8GB and it replicating it works pretty well.


You wouldn't use this if that was a requirement


BS. API layers exist for a reason. Maybe they will realize that reason if they try to expose the DB...


I used to be a big fan of fly.io until I faced a lot of deployment issue. Often times the server will go down for few minutes and it was not even updated on the status page. Is this still the case or has things gone better?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: